0% found this document useful (0 votes)
5 views

Week 5 Chapter 6

Uploaded by

muneebmalik5527
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Week 5 Chapter 6

Uploaded by

muneebmalik5527
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

REQUIREMENTS AS THE DRIVING

FORCE FOR DATA WAREHOUSING


Week #05
Lecture # 10
Slide Elements
• Business Requirements as the driving force for
• Data Design
• Architectural Plan
• Data Storage Specifications
• Information Delivery Strategy
BUSINESS REQUIREMENTS
AS THE DRIVING FORCE
Business Requirements – Driving Force
• In a data warehouse, business requirements of the users form the
single and most powerful driving force.

• Every task that is performed in every phase in the development of the


data warehouse is determined by the requirements.

• Every decision made during the design phase—whether it may be the


data design, the design of the architecture, the configuration of the
infrastructure, or the scheme of the information delivery methods—is
totally influenced by the requirements.

• Your requirements definition (document) will drive every phase of the


project, so pay special attention.
Business Requirements – Driving Force
(Cont. )
• Figure depicts this fundamental principle.
Data Design - 1
• In the data design phase, you come up with the data model for the
following data repositories:
• The staging area where you transform, cleanse, and integrate the data from the
source systems in preparation for loading into the data warehouse repository
• The data warehouse repository itself.
• These data models will form the blueprint for the physical design and
implementation of the data repositories.
• To understand impact of requirements on data design, imagine the
data model as pyramid of data contents shown in Figure on Next
Slide.
• The base of pyramid represents data model for enterprise-wide data
repository and the top half of the pyramid denotes the dimensional
data model for the data marts.
• What do you need in requirements definition to build and meld the two
halves of the pyramid? Two basic pieces of information are needed:
the source system data models and the information package diagrams
Data Design - 2
• The data models of the current source systems will be used for the
lower half. Therefore, ensure that your requirements definition
document contains adequate information about the components and
the relationships of the source system data.
Data Design - 3
What must be reflected in the data model?
• Structure for Business Dimensions
• In the data models for the data marts, the business dimensions along which the
users analyze the business metrics must be featured prominently. The usefulness of
the data mart is directly related to the accuracy of the data model.
• Structure for Key Measurements
• Key measurements are the metrics or measures that are used for business analysis
and monitoring. Users measure performance by using and comparing key
measurements.
• Levels of Detail
• What else must be reflected in the data model? Data model must include structures
to hold details as well as summary data. Find out about the essential drill down and
roll up functions and include enough particulars about the types of summary and
detail levels of data your data warehouse must hold.
Architectural Plan - 1
• Data warehouse architecture refers to the proper arrangement of the
architectural components for maximum benefit.
• Let us recap the major architectural components:
• Source data ( Production data, Internal data, Archived data, External data )
• Data staging ( Data extraction, Data transformation, Data loading )
• Data storage
• Information delivery
• Metadata
• Management and control
• When you plan the overall architecture for your data warehouse, you
will be setting the scope and contents of each of these components.
• Planning the architecture involves reviewing each of the components
in the light of your particular context, and setting the parameters. Also,
it involves the interfaces among the various components.
• All the information you need to plan the architecture must come from
the requirements definition.
Source Data
• Operational Source Systems: Systems that manage day-to-day
business operations, such as CRM, ERP, and transaction systems.
• Computing Platforms, Operating Systems, Databases, Files: The
infrastructure that runs applications and stores data, including
Windows, Linux, Oracle, SQL Server, NoSQL, and file systems.
• Departmental Data (Files, Documents, Spreadsheets): Data
generated within business departments, often stored in formats like
Excel, Word, PDFs, and shared via local or cloud storage.
• External Data Sources: Third-party data such as market data,
weather reports, social media feeds, and data from partners or
vendors. These are usually obtained via APIs, data feeds, or web
scraping.
Data staging
• Data Mapping: The process of defining relationships between data
fields in different source systems and corresponding fields in the
staging area to ensure accurate transfer.
• Data Transformations: Altering or converting data formats,
structures, or values to meet the requirements of the target system or
improve compatibility between systems.
• Data Cleansing: Identifying and correcting errors, inconsistencies,
duplicates, or missing data to ensure quality and reliability of the data
before further use.
• Data Integration: Combining data from multiple sources into a
unified view or database, ensuring consistent and accurate
information across systems for analysis and reporting.
Data Storage
• Extracted and Integrated Data Size: Refers to the volume of data
stored after extraction from source systems and integration, typically
in gigabytes, terabytes, or more.
• DBMS Features: Capabilities of the database management system,
such as scalability, indexing, security, concurrency control, and
backup/recovery options.
• Growth Potential: The ability of the storage system to handle
increasing amounts of data over time, considering both hardware and
software scalability.
• Centralized or Distributed: Describes whether the data storage is
managed in a single centralized location (easier control) or spread
across multiple locations (improved redundancy and access speed).
Information Delivery
• Types and Number of Users:
Executive Users: Use high-level summaries, dashboards.
Operational Users: Access detailed reports, day-to-day data.
Analysts: Explore data for trends, predictions.
Number of users: Varies from a few decision-makers to large teams across
departments.
• Types of Queries and Reports:
Ad-hoc Queries: On-demand, specific data requests.
Standard Reports: Pre-defined, routine reports (e.g., daily sales).
Drill-down Reports: Let users explore data at multiple levels of detail.
• Classes of Analysis:
Descriptive Analysis: What happened? (Summarizing data).
Predictive Analysis: What might happen? (Using past data to forecast).
Prescriptive Analysis: What should be done? (Suggesting actions based on data).
• Front-end DSS Applications:
Dashboards: Visual summaries for executives.
OLAP Tools: Online Analytical Processing for multidimensional data exploration.
Data Visualization Tools: Presenting data in charts, graphs, etc., for intuitive
understanding.
Metadata
1.Operational Metadata
• Information about the processes and systems that manage data (e.g., job run
times, logs, system performance).
• Used to monitor and manage ETL jobs and other operational processes.
2. ETL (Data Extraction/Transformation/Loading) Metadata
• Describes the data's journey from source to destination (e.g., data sources,
transformations applied, data loading steps).
• Ensures transparency and traceability in the ETL process.
3. End-user Metadata
• Describes data from a user’s perspective (e.g., data definitions, business terms,
reports).
• Helps users understand and work with data effectively.
4. Metadata Storage
• Refers to where and how metadata is stored (e.g., databases, metadata
repositories).
• Ensures metadata is easily accessible for both operational and analytical
purposes.
Management And Control
1. Data Loading (Management and Control)
• Involves managing how data is transferred from source to target
systems, ensuring data integrity, accuracy, and completeness.
2. External Sources
• Refers to handling and integrating data from external systems or
third-party sources, ensuring proper validation and transformation
for consistency.
3. Alert Systems
• Monitors data processes and generates real-time alerts for
failures, anomalies, or performance issues in data pipelines.
4. End-user Information Delivery
• Ensures that processed data is accurately and efficiently delivered
to end-users, often through dashboards, reports, or other
accessible formats.
Architectural Plan - 3
Special Considerations
When you are in the requirements definition phase, you have to pay
special attention to these factors:
1. Data Extraction/Transformation/Loading (ETL)
• Data Extraction : Clearly identify all the internal data sources. Specify all
the computing platforms and source files from which the data is to be
extracted. If you are going to include external data sources, determine the
compatibility of your data structures with those of the outside sources. Also
indicate the methods for data extraction.
• Data Transformation : Many types of transformation functions are needed
before data can be mapped and prepared for loading into the data
warehouse repository. These functions include input selection, separation of
input structures, normalization and de-normalization of source structures,
aggregation, conversion, resolving of missing values, and conversions of
names and addresses.
• Data Loading : Define the initial load. Determine how often each major
group of data must be kept up-to-date in the data warehouse.
Architectural Plan - 4
2. Data Quality
• Bad data leads to bad decisions. Therefore, right in the early phase of
requirements definition, identify potential sources of data pollution and be
aware of all the possible types of data quality problems likely to be
encountered in your operational systems.
• Data Pollution Sources
System conversions and migrations | Heterogeneous systems integration |
Inadequate database design of source systems | Data aging | Incomplete
information from customers | Input errors | Internationalization/localization of
systems | Lack of data management policies/procedures
• Types of Data Quality Problems
Dummy values in source system fields | Absence of data in source system
fields | Multipurpose fields | Cryptic data | Contradicting data | Improper use of
name and | address lines | Violation of business rules | Reused primary keys |
Non-unique identifiers
Data Pollution Sources
1. System Conversions and Migrations
Data errors or inconsistencies during system upgrades or transfers from one platform to another.
2. Heterogeneous Systems Integration
Combining data from diverse systems with different formats and structures, leading to inconsistencies.
3. Inadequate Database Design of Source Systems
Poorly structured databases can result in data redundancy, loss, or inaccuracies.
4. Data Aging
Over time, data becomes outdated, irrelevant, or less accurate, leading to poor data quality.
5. Incomplete Information from Customers
Missing or partial data input by customers can result in gaps and inaccurate analysis.
6. Input Errors
Human mistakes during data entry can lead to incorrect or incomplete data being stored.
7. Internationalization/Localization of Systems
Challenges in managing different languages, currencies, formats, or regional standards can cause data
inconsistencies.
8. Lack of Data Management Policies/Procedures
Absence of clear guidelines for data governance can lead to unregulated, inconsistent, or poor-quality
data management.
Types of Data Quality Problems
1. Dummy Values in Source System Fields
• Placeholder values (e.g., “N/A” or “0000”) used instead of real data, which can mislead analysis.
2. Absence of Data in Source System Fields
• Missing values in critical fields, leading to incomplete datasets and unreliable insights.
3. Multipurpose Fields
• Using a single field for multiple data types or purposes, causing confusion and inconsistent data interpretation.
4. Cryptic Data
• Data that is unclear or not easily understood (e.g., codes or abbreviations), complicating analysis and
decision-making.
5. Contradicting Data
• Conflicting information in different records or systems, leading to confusion and trust issues in data reliability.
6. Improper Use of Name and Address Lines
• Inconsistent formatting or incorrect segmentation of name and address fields, causing errors in
communication and delivery.
7. Violation of Business Rules
• Data that does not comply with predefined business logic or rules, potentially leading to operational issues or
analysis errors.
8. Reused Primary Keys
• Using the same primary key for multiple records can cause data integrity issues and confusion in relational
databases.
9. Non-unique Identifiers
• Duplicate identifiers that should be unique, leading to ambiguity in data identification and retrieval.
Architectural Plan - 5
3. Meta Data
• Metadata acts as a glue to tie all the components together
• Earlier, we categorized the metadata in a data warehouse into three
groups: operational, data extraction and transformation, and end-user.
Architectural Plan - 6
Tools and Products
In general, tools are available for the following functions:
• Data Extraction and Transformation
• Middleware
• Data extraction
• Data transformation
• Data quality assurance
• Load image creation

• Warehouse Storage
• Data marts
• Metadata

• Information Access/Delivery
• Report writers
• Query processors
• OLAP
• Alert systems
• DSS applications
• Data mining
Data Storage Specifications - 1
• If your company is adopting the top-down approach of developing
the data warehouse, then you have to define the storage
specifications for
• The data staging area
• The overall corporate data warehouse
• Each of the dependent data marts, beginning with the first
• Any multidimensional databases for OLAP
• Alternatively, if your company opts for the bottom-up approach, you
need specifications for
• The data staging area
• Each of the conformed data marts, beginning with the first
• Any multidimensional databases for OLAP
• Typically, the overall corporate data warehouse will be based on the
relational model supported by a relational database management
system (RDBMS).
Data Storage Specifications - 2
• Whatever your choice of the database management system may be,
that system will have to interact with back-end and front-end tools.
• The back-end tools are the products for data transformation, data
cleansing, and data loading. The front-end tools relate to information
delivery to the users.
• While defining requirements, bear in mind their influence on data
storage specifications and collect all the necessary details about the
back-end and the frontend architectural components
DBMS Selection:
• The following elements of business requirements affect the choice of
the DBMS:
1. Level of user experience. If the users are totally inexperienced with
database systems, the DBMS must have features to monitor and control
runaway queries. DBMS must support an easy SQL-type language interface.
Data Storage Specifications - 3
2. Types of queries. The DBMS must have a powerful optimizer if most of
the queries are complex and produce large result sets.
3. Need for openness. The degree of openness depends on the back-end
and front-end architectural components and those, in turn, depend on the
business requirements.
4. Data loads. The data volumes and load frequencies determine the
strengths in the areas of data loading, recovery, and restart.
5. Metadata management. Let your requirements definition reflect the type
and extent of the metadata framework.
6. Data repository locations. Is your data warehouse going to reside in one
central location, or is it going to be distributed? The answer to this question
will establish whether the selected DBMS must support distributed
databases.
7. Data warehouse growth. Your business requirements definition must
contain information on the estimated growth in the number of users, and in
the number and complexity of queries. The growth estimates will have a
direct relation to how the selected DBMS supports scalability.
Data Storage Specifications - 4

Storage Sizing
• You need to estimate the storage sizes for the following in the
requirements definition phase:
• Data staging area. Calculate storage estimates for data staging area
from sizes of source system data structures for each business subject.
• Overall corporate data warehouse. For each business subject, list
the various attributes, estimate their field lengths, and arrive at the
calculation for the storage needed for that subject.
• Data Marts, dependent or conformed. Use the details of the
business dimensions and business measures found in the information
diagrams to estimate the storage size for the data marts.
• Multidimensional databases. Work out the details of. OLAP planned
for your users and then use those details to estimate storage for these
multidimensional databases.
Information Delivery Strategy - 1
• The impact of business requirements on the information delivery
mechanism in a data warehouse is straightforward.
• During the requirements definition phase, users tell you what
information they want to retrieve from the data warehouse.
• You record these requirements in the requirements definition
document.
• You then provide all the desired features and content in the
information delivery component.
• Although the impact appears to be straightforward and simple, there
are several issues to be considered.
• Many different aspects of the requirements impact various elements of
the information delivery component in different ways.
Information Delivery Strategy - 2
• The broad areas of the information delivery component directly
impacted by business requirements are:
• Queries and reports
• Types of analysis
• Information distribution
• Decision support applications
• Growth and expansion

1. Queries and reports


• Find out who will be using predefined queries and preformatted reports.
Get the specifications.
• Also, get the specifications for the production and distribution frequency
for the reports.
Information Delivery Strategy - 3
2. Types of analysis
Most of today’s data warehouse environments equip users with OLAP.
Using the OLAP facilities, users can perform multidimensional analysis and
obtain multiple views of the data from multidimensional databases. This
type of analysis is called slicing and dicing. Estimate the nature and extent
of the drill-down and roll-up facilities to be provided for. Determine how
much slicing and dicing has to be made available.
3. Information distribution
• Where are your users? Are they in one location? Are they in one local site
connected by a local area network (LAN)? Are they spread out on a wide
area network (WAN)? These factors determine how information must be
distributed to your users. Clearly indicate these details in the requirements
definition.
Information Delivery Strategy - 4
4. Decision support applications
• These are specialized applications designed to support individual groups of
users for specific purposes. An executive information system provides decision
support to senior executives. A data mining application is a special-purpose
system to discover new patterns of relationships and predictive possibilities.
5. Growth and expansion
• The information delivery component continues to grow and expand. It continues to grow
in the number and complexity of queries and reports. It expands in the enhancements to
each part of the component. In your original requirements definition you need to
anticipate the growth and expansion. Enough details about the growth and expansion
can influence the proper design of the information delivery component, so collect
enough details to estimate the growth and enhancements.

You might also like