dw mod 3
dw mod 3
UNIVERSITY
JNANA SANGAMA, BELGAVI-590018, KARNATAKA
DATA WAREHOUSING
(As per CBCS Scheme 2022)
PREPARED BY:
INDHUMATHI R (ASST.PROF DEPT OF DS (CSE), KNSIT)
MODULE 3
Chapter 7: Architectural Components
1. Data Acquisition
2. Data Storage
3. Information Delivery
1. Data Acquisition
Source Data: This is the raw data coming from different sources. It could be from:
Page | 2
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
2. Data Storage
Page | 3
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Information Delivery
Report/Query: This component enables users to create reports and run queries on the
data warehouse. It allows them to retrieve specific information based on their
requirements.
OLAP (Online Analytical Processing): OLAP tools allow users to perform complex,
multi-dimensional analysis. With OLAP, users can perform actions like drilling down
into detailed data or rolling up for aggregated insights.
Data Mining: This involves identifying patterns and correlations in the data, often
using algorithms to predict future trends or behaviors. Data mining helps organizations
uncover insights that are not immediately visible.
Purpose: Information delivery ensures that the data stored in the warehouse is
accessible and useful to end-users, allowing for strategic decision-making.
This component oversees the data warehouse's operations. It ensures that the data flow
between components is efficient, monitors data quality, and manages access to various
parts of the system.
2. Distinguishing Characteristics
Data Content
Data in a warehouse is mostly read-only, meaning it’s not regularly modified but is
stored for analysis. It’s integrated from multiple sources and represents historical data
rather than real-time, current data. This enables a complete view of long-term trends.
Page | 4
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Data warehouses support complex, interactive analysis. Users can perform tasks like
“drilling down” to view detailed data, “rolling up” to see aggregated data, and “slicing
and dicing” to look at data from different perspectives. This functionality is critical for
making quick, strategic decisions.
Since data warehouses store years of historical data, they need to handle very high
volumes efficiently. This is particularly important in large organizations where the data
generated is vast.
Data warehouses need to be flexible and adaptable. As business needs evolve, new
requirements may emerge, so the architecture must allow for easy updates and
adjustments.
Metadata-Driven:
Metadata plays a crucial role in managing and understanding the data warehouse.
3. Architectural Framework
Page | 5
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Page | 6
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
4. Technical Architecture
The example also indirectly refers to the technical architecture by showing the set of
functions (e.g., data extraction, transformation, storage, and retrieval) and services
provided within each architectural component.
Although specific tools weren’t named, the scenario emphasizes that the architecture
is designed first, then tools are selected based on the complexity and scope of the
platform's needs (e.g., whether sophisticated extraction tools are necessary based on the
data sources' variety).
Context:
ShopifyPlus is an online retail platform that sells a wide variety of products, including
electronics, apparel, and home goods. It operates in multiple countries and has millions of
users visiting its site daily. ShopifyPlus wants to leverage a data warehouse to provide
insights for decision-making, customer behavior analysis, inventory management, sales
forecasting, and marketing campaigns.
1. Data Acquisition
Overview:
Data acquisition involves extracting data from various sources, processing it, and loading it
Page | 7
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
into the data warehouse for analysis and reporting. For ShopifyPlus, data is extracted from
internal and external systems and stored in a staging area before being transformed.
Key Components:
Data Sources:
ShopifyPlus gathers data from:
o Operational systems, such as transactional databases (e.g., order management
system, inventory management).
o Legacy systems, which might contain archived transaction history.
o ERP (Enterprise Resource Planning) system data, which consolidates sales and
supply chain data.
o External sources, like market trend data, social media analytics, and customer
feedback.
Intermediary Data Stores:
During extraction, ShopifyPlus pulls data into temporary files for pre-processing. For
instance, they might merge data from various regional warehouses or split files by
product category before moving them into the staging area.
Staging Area:
The staging area serves as the main preparation ground where all extracted data is
cleaned, transformed, and merged. For ShopifyPlus, this area processes:
o Product and inventory data: Cleansed for duplicates and inconsistencies.
o Customer transaction data: Merged across multiple regions to ensure unique
customer profiles.
o Sales data: Aggregated by time and region for trend analysis.
Data Extraction:
ShopifyPlus applies filters to select relevant data from operational systems, like
extracting only high-value customer transactions for specific analysis. They create
intermediary files for merging similar data before moving to staging.
Data Transformation:
o Cleaning and Deduplication: Removing duplicate customer records.
Page | 8
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
2. Data Storage
Overview:
Data storage involves loading the transformed data into a central repository, typically an
RDBMS (Relational Database Management System), where data can be stored in a structured
and organized way for analysis.
Key Components:
Data Repository:
ShopifyPlus’s data warehouse is organized with relational databases where data is
stored in a star schema with fact and dimension tables.
o Fact Tables: Transaction details (e.g., sales amount, quantity sold).
o Dimension Tables: Customer profiles, product categories, geographic
regions, and time.
Loading Data:
The initial data load involves a full refresh to populate tables, while incremental
updates occur daily to add new transactions and update existing records.
Data Granularity:
ShopifyPlus keeps detailed data at the transaction level for analysis. However, they
also store aggregated data for faster access when running high-level reports.
Page | 9
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
3. Information Delivery
Overview:
The information delivery area involves making data accessible to end-users through reports,
dashboards, and OLAP (Online Analytical Processing) cubes.
Key Components:
Data Flow:
Data flows from the central warehouse to different data marts (e.g., for marketing,
finance, and sales teams) and directly to dashboards. For ShopifyPlus, key insights are
presented in real-time dashboards, ad hoc reports, and via OLAP cubes.
Service Locations:
Query services are available on user desktops, and users can directly access
dashboards for sales and inventory data. ShopifyPlus uses an application server to
generate reports automatically at regular intervals.
Data Stores for Information Delivery:
Page | 10
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
o Temporary Data Stores: Used to save query results that are frequently
accessed, such as customer segments based on purchase history.
o OLAP Cubes: OLAP cubes are built for multidimensional analysis, providing
quick access to sales by product category, region, and time.
o Dashboards and Scorecards: Real-time dashboards display current
inventory, recent sales, and revenue trends.
Page | 11
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Overview:
This area includes the management and administration of the data warehouse, ensuring
efficient operations, data integrity, and consistent performance.
Key Components:
Metadata Management:
Metadata catalogs describe data sources, transformations, and structures in the data
warehouse, enabling traceability and governance.
Monitoring and Optimization:
ShopifyPlus monitors usage to identify and improve bottlenecks in the system. For
example, they track query execution times and usage patterns to refine indexes and
tune query performance.
Backup and Security:
Regular backups and data encryption ensure that ShopifyPlus's data is protected.
Security measures prevent unauthorized access, while audit trails record who accessed
what data.
Page | 12
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Query Governance:
Query limits prevent system overload by restricting large or complex queries.
Event Triggers:
Triggers alert administrators when data loads are complete, allowing teams to validate
data before it's available to users.
Metadata Management:
ShopifyPlus stores metadata for every data element in the warehouse, including
source, transformation rules, and access permissions.
System Monitoring and Fine-Tuning:
Continuous monitoring helps in improving performance. ShopifyPlus periodically
archives older data to maintain system efficiency and prevent data bloat.
Audit Trails and Security Logs:
Audit trails are maintained for compliance and to track data lineage, ensuring that data
accuracy can be verified if needed.
Page | 13
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
5. Architectural Types
Structure: Multiple, isolated data marts serve individual departments or purposes, with
no integration or connection between them.
Data Flow: Each data mart is built separately and often uses data relevant only to its
specific department. There’s no staging area that combines or integrates these data
marts.
End-User Access: Users in each department access business intelligence from their
respective data marts only.
Page | 14
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Characteristics: This architecture can emerge without an overall plan, leading to data
silos, where data is stored independently within each data mart, without a unified view
of enterprise data.
3. Federated Architecture
4. Hub-and-Spoke Architecture
Structure: A centralized data warehouse acts as the main data source, with dependent
data marts that draw from this warehouse.
Data Flow: Data flows from source systems to a staging area, then to the central data
warehouse. From there, it flows to dependent data marts, which serve specific user
needs.
End-User Access: Users can access information either from the centralized data
warehouse or from the dependent data marts.
Page | 15
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Scenario: ShopVerse operates without a distinct, single data warehouse. Instead, it has
a series of conformed data marts, each designed to serve specific business functions,
like Sales, Marketing, and Inventory.
Page | 16
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Data Flow: Data flows from source systems into a staging area, where it is standardized
and then distributed to the conformed data marts. Common business dimensions, like
“Customer” and “Product,” are shared across these data marts.
End-User Access: End-users in various departments can access reports and analysis
based on data in the conformed marts, which are linked to provide a comprehensive
view.
Characteristics: The collection of conformed data marts effectively serves as a unified
data warehouse, allowing departments to access interconnected data marts for
enterprise-wide insights.
Page | 17
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
1. Importance of Infrastructure
Supporting Infrastructure:
Page | 18
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
While the physical infrastructure provides the technical foundation, the operational
infrastructure ensures smooth functioning. Both are crucial for data warehouse
efficiency; without the right operational support, even a well-designed physical setup
may not perform optimally.
Page | 19
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
2. Hardware Requirements
Infrastructure Considerations:
When selecting hardware for a data warehouse, it's essential to evaluate how much
existing infrastructure can be leveraged and keep a modular approach for easy upgrades.
Assess the capacity of the current systems to determine how much storage or processing
power is available or if new components are needed.
Scalability: Ensure the hardware can grow to support increased data and user demands.
Vendor Support and Stability: Opt for reliable hardware vendors that can offer robust
support.
Platform Compatibility: Choose hardware compatible with the operating system and
database software used in the warehouse.
Scalability: The OS must support growth in user base and query complexity.
Security, Reliability, and Availability: The OS must provide a secure, stable, and
resilient environment.
Preemptive Multitasking & Multithreading: For effective multitasking and
distribution of workloads across processors, crucial for handling multiple simultaneous
requests.
Platform Options:
Mainframes: Typically outdated and not cost-effective for data warehousing, but may
be repurposed for small data marts if there’s spare capacity.
Open System Servers (like UNIX): Common in data warehousing for their robustness
and support for parallel processing.
NT Servers: Suitable for small to medium-sized warehouses but with limited parallel
processing capabilities.
Page | 20
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Data extraction should ideally happen on the original source platform, with subsequent
tasks like reformatting, merging, and preliminary cleansing done there.
Major transformations, consolidation, and validation are best suited for a staging area
platform, which is where all data prepares for loading into the main data warehouse.
In One of the Legacy Platforms: If most of your legacy data sources are on the same
platform and if extra capacity is readily available, then consider keeping your data
staging area in that legacy platform. In this option, you will save time and effort in
moving the data across platforms to the staging area.
On the Data Storage Platform: This is the platform on which the data warehouse
DBMS runs and the database exists. When you keep your data staging area on this
platform, you will realize all the advantages for applying the load images to the
database. You may even be able to eliminate a few intermediary substeps and apply
data directly to the database from some of the consolidated files in the staging area.
On a Separate Optimal Platform: You may review your data source platforms,
examine the data warehouse storage platform, and then decide that none of these
Page | 21
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
platforms are really suitable for your staging area. It is likely that your environment
needs complex data transformations. It is possible that you need to work through your
data thoroughly to cleanse and prepare it for your data warehouse. In such
circumstances, you need a separate platform to stage your data before loading to the
database
Page | 22
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Client/Server Architecture
Components:
o Desktop Clients: User-facing layer (e.g., browser for reports, dashboards).
o Application Servers: Middleware for connectivity, metadata, authentication,
OLAP, and queries.
Page | 23
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Page | 24
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
o Features:
This is a shared-everything architecture, the simplest parallel processing
machine.
Each processor has full access to the shared memory through a common
bus.
Communication between processors occurs through common memory.
Disk controllers are accessible to all processors.
o Benefits:
This is a proven technology that has been used since the early 1970s. It
provides high concurrency.
You can run many concurrent queries.
It balances workload very well.
It gives scalable performance; simply add more processors to the system
bus.
Being a simple design, you can administer the server easily.
o Limitations: Available memory may be limited.
Performance may be limited by bandwidth for processor-to-processor
communication, I/O, and bus communication.
Availability is limited; like a single computer with many processors
Clusters:
o Nodes with independent memory, sharing a common disk.
o Example: A medium-sized retail chain ensuring system availability through
clustered servers.
Page | 25
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
o Features:
Each node consists of one or more processors and associated memory.
Memory is not shared among the nodes; it is shared only within each
node. Communication occurs over a high-speed bus.
Each node has access to the common set of disks.
This architecture is a cluster of nodes.
o Benefits:
This architecture provides high availability; all data is accessible even if
one node fails.
It preserves the concept of one database.
This option is good for incremental growth.
o Limitations:
Bandwidth of the bus could limit the scalability of the system.
This option comes with a high operating system overhead.
Each node has a data cache; the architecture needs to maintain cache
consistency for internode synchronization. A cache is a “work area”
holding currently used data; the main memory is like a big file cabinet
stretching across the entire room
Page | 26
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
o Features:
This is a shared-nothing architecture.
This architecture is more concerned with disk access than memory
access.
It works well with an operating system that supports transparent disk
access.
If a database table is located on a particular disk, access to that disk
depends entirely on the processor that owns it.
Internode communication is by processor-to-processor connection.
o Benefits:
This architecture is highly scalable.
The option provides fast access between nodes.
Any failure is local to the failed node; this improves system availability.
Generally, the cost per node is low.
o Limitations:
The architecture requires rigid data partitioning.
Data access is restricted.
Workload balancing is limited.
Cache consistency must be maintained
Page | 27
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Features:
o This is the newest architecture; it was developed in the 1990s.
o The NUMA architecture is like a big SMP broken into smaller SMPs that are
easier to build
o The hardware considers all memory units as one giant memory. The system
has a single real memory address space over the entire machine; memory
addresses begin with 1 on the first node and continue on the following nodes.
Each node contains a directory of memory addresses within that node.
o In this architecture, the amount of time needed to retrieve a memory value
varies because the first node may need the value that resides in the memory of
the third node. That is why this architecture is called nonuniform memory
access architecture.
Benefits:
o Provides maximum flexibility.
o Overcomes the memory limitations of SMP.
o Better scalability than SMP.
o If you need to partition your data warehouse database and run these using a
centralized approach, you may want to consider this architecture. You may
also place your OLAP data on the same server.
Page | 28
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Limitations:
o Programming NUMA architecture is more complex than even with MPP.
o Software support for NUMA is fairly limited.
o The technology is still maturing
Casual Users: Machines supporting basic HTML reports (e.g., store managers viewing
daily sales).
Power Users: Machines supporting OLAP and dashboards (e.g., analysts creating
predictive models).
Stages:
o Initial: Data staging and storage on the same platform.
o Growing: Segregated staging and storage platforms.
o Matured: Specialized platforms for development, staging, and storage.
Example: A startup begins with a single server, but as it grows, it transitions to cloud-
based storage and processing for scalability.
Page | 29
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
3. Software Requirements
Modern RDBMS products assist in acquiring and integrating data from diverse sources:
Mass Data Loading: Vendors offer utilities that handle the bulk import of large
datasets, reducing manual effort and improving efficiency.
Data Retrieval from Heterogeneous Systems: Features like database links and
connectors enable seamless access to data stored in various systems, whether structured
or unstructured.
Data Transformation: Advanced tools embedded in the RDBMS support cleaning,
merging, and enriching data as part of the ETL process.
Indexing is crucial in data warehouses where queries often access large datasets:
Bit-Mapped Indexes:
o Ideal for columns with fewer distinct values, such as gender or region codes.
o Significantly accelerates queries involving conditions like grouping or filtering
on these fields.
Page | 30
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
o Example: A query filtering customers based on region will benefit from faster
retrieval using a bit-mapped index.
4. Load Balancing
With heavy query traffic in data warehouses, load balancing ensures optimal resource
utilization by:
Splits a single query into smaller operations (e.g., reading, joining, sorting) and executes
them in parallel.
Example:
A single query analyzing total sales across all regions is processed by dividing tasks like data
reads and aggregations.
Data is partitioned across disks, and tasks like reading are parallelized.
Example:
Reading customer data from servers in different regions.
Page | 31
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
5. Selection of DBMS
Page | 32
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Application:
Amazon selects a DBMS with high scalability to handle millions of transactions daily and
supports portability for deployment on multiple cloud platforms.
6. Collection of Tools
Application:
Amazon uses:
Page | 33
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Principle:
Always design the architecture before selecting the tools. Architecture defines the
structure, functions, and services of the data warehouse components, ensuring they
meet business requirements. Tools should then be chosen to align with these defined
functions.
Why This is Important:
If tools are chosen first, they may not align with the actual needs of the architecture.
For example, if a report generation tool is selected prematurely, it might fail to
support complex queries or visualization needs of power users, leaving the data
warehouse ineffective.
Data warehousing tools are categorized based on the tasks they perform, such as data modeling,
extraction, transformation, loading, quality assurance, and analytics. Each tool type has distinct
features.
Purpose: Create and maintain data models for source systems, staging areas, and the
data warehouse.
Features:
o Forward engineering: Generate database schemas.
o Reverse engineering: Create models from existing databases.
o Support dimensional modeling like STAR schemas.
Purpose: Extract data from source systems for loading into the warehouse.
Extraction Methods:
o Bulk extraction: Full data refresh.
o Change-based replication: Incremental data loads.
Page | 34
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Purpose: Convert extracted data into formats suitable for loading into the warehouse.
Features:
o Splitting fields, consolidating data, standardization, and deduplication.
o Applying default values as needed.
g. Dashboards
Page | 35
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
h. Scorecards
j. Alert Systems
Page | 36
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Definition:
A DW appliance is an integrated device combining hardware, software, and storage,
designed specifically for data warehousing tasks.
a. Evolution of DW Appliances
b. Benefits of DW Appliances
1. Cost Reduction:
o Lower initial setup and maintenance costs.
o Reduced support costs with single-vendor solutions.
2. Performance Improvement:
o Mixed workloads (queries, reports, analysis) run efficiently.
o Parallel processing reduces time for queries and reports.
3. High Availability:
o Built-in redundancies, such as backup networks and disk mirroring.
4. Reduced Administration:
o Automated processes minimize troubleshooting and maintenance.
5. Scalability:
o Modular designs allow easy addition of capacity and performance.
6. Reliability:
o Homogeneous systems eliminate integration issues.
7. Faster Implementation:
o Reduced need for testing and integration speeds up deployment.
Page | 37
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
1. Understanding Metadata
What is Metadata?
Definition: Metadata is "data about data." It provides information about the data stored in the
data warehouse, such as its source, structure, format, updates, and how it is processed.
Developers need metadata to map, extract, and transform data from source systems
into the data warehouse.
Metadata defines the structure of both the source data and the target warehouse.
Page | 38
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Metadata helps maintain, enhance, and expand the data warehouse by tracking data
sources, update cycles, and structural changes.
It assists in monitoring query performance, handling storage upgrades, and disaster
recovery.
Page | 39
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Metadata acts as a communication hub that connects various components of the data
warehouse, such as:
Metadata provides users with tools to browse, search, and query the warehouse
effectively.
It prevents misinterpretations by clarifying data definitions, relationships, and usage
rules.
Page | 40
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Developers rely on metadata to build and maintain the warehouse. It provides details
about source-to-target mappings, transformation rules, and data refresh cycles.
Page | 41
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
The architectural framework provides a blueprint of how the data warehouse components
interact to support business requirements. It is a high-level design that outlines the overall
structure, guiding principles, and data flow within the warehouse.
Key Components
1. Source Systems:
These are operational systems like CRM, ERP, or transactional databases that feed
data into the warehouse.
Example: Sales data from a POS system or customer data from a CRM platform.
2. ETL Processes:
ETL (Extract, Transform, Load) tools extract data from source systems, transform it
into a usable format, and load it into the data warehouse.
Example: Extracting product sales from multiple stores, converting currencies into a
single standard, and loading them into the warehouse.
3. Data Warehouse Storage:
This is the central repository where cleaned, integrated, and consolidated data is
stored for analysis.
Example: A database that stores daily sales totals by product, store, and region.
4. Data Marts:
These are smaller, subject-specific databases designed for specific business areas like
sales, marketing, or finance.
Page | 42
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Technical architecture focuses on the tools, technologies, and infrastructure that support
the implementation and operation of the data warehouse. It addresses how data is processed,
stored, and accessed at a technical level.
Key Layers
Page | 43
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
3. Storage Layer:
Includes the data warehouse database, often implemented using relational databases
(e.g., Oracle, Snowflake) or cloud platforms (e.g., AWS Redshift).
Example: A Snowflake cloud-based warehouse storing sales data partitioned by year.
4. Processing Layer:
Analytical processing is performed using tools like OLAP (Online Analytical
Processing).
Example: Precomputing aggregated sales data for faster query performance.
5. Access Layer:
Enables end-users to access data through BI tools, dashboards, and APIs.
Example: Analysts use Power BI to create revenue trend visualizations.
Key Features:
1. Data Acquisition
Processes:
This area involves acquiring data from various sources and preparing it for storage in the data
warehouse. Key processes include:
Metadata Types:
The metadata recorded here provides detailed information about the source systems and
transformation rules. Examples include:
Page | 44
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Source system platforms and models: Logical and physical structures of data
sources.
Data extraction methods: Techniques used to extract data.
Transformation and cleansing rules: Definitions for modifying and cleaning data.
Summarization rules: Rules for creating aggregated data like totals or averages.
Source-to-target mappings: Relationships between source and target systems.
2. Data Storage
Processes:
Once data is acquired, it is loaded into the data warehouse and managed for long-term use.
Processes include:
Metadata Types:
Metadata in this area relates to data loading and storage management, such as:
Page | 45
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
3. Information Delivery
Processes:
This area is focused on making data accessible to end-users through queries, reports, and
analytical tools. Processes include:
Metadata Types:
Metadata here supports reporting and analytics, including:
Page | 46
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Purpose:
Business metadata is designed to help end-users (e.g., managers, analysts) understand and
utilize the data warehouse by expressing data in meaningful, non-technical terms.
Key Characteristics:
Page | 47
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Who Benefits:
Managers, analysts, power users, casual users, and executives use business metadata
for efficient decision-making and analysis.
Technical Metadata
Purpose:
Technical metadata supports IT professionals (e.g., developers, administrators) in building,
maintaining, and administering the data warehouse.
Key Characteristics:
Page | 48
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Data models of source systems and mappings (source → staging → data warehouse).
Data extraction, transformation, and cleansing rules.
Data loading schedules and controls.
Physical and logical database designs.
Summarizations, aggregations, and derivation rules.
Authority/access privileges.
Connectivity information (e.g., network/server details).
Purge and archival rules.
Who Benefits:
IT staff like data architects, ETL developers, database administrators, and system
administrators rely on technical metadata for effective implementation and
management.
1. Project Manager: Helps plan and monitor the project, understanding data
dependencies.
2. Data Warehouse Administrator: Ensures smooth operations by maintaining
metadata records.
3. Database Administrator: Manages and optimizes database performance using
metadata.
4. Metadata Manager: Oversees metadata collection and integration.
Page | 49
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Metadata in a data warehouse is essential for understanding and managing the system. Here's
how it's provided:
Page | 50
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
1. Proprietary Formats: Each tool uses its own metadata format, making integration
difficult.
2. Lack of Standards: No universally accepted metadata standards exist.
3. Centralized vs. Decentralized: Debate over a single repository versus multiple
fragmented stores.
4. Version Control: Keeping metadata consistent over time is tedious.
5. Conflicting Definitions: Different source systems may use varying naming
conventions, definitions, and rules.
Metadata Requirements
Sources of Metadata
Metadata originates from multiple points in the data warehouse lifecycle, such as:
1. Source Systems: Data models, file layouts, and documentation from operational
systems.
2. Data Extraction: Details on selected fields, extraction schedules, and methods.
3. Data Transformation: Rules for converting, validating, and auditing data.
4. Data Loading: Specifications for loading data into the warehouse, including audit
trails and schedules.
5. Data Storage: Metadata on the structure and grouping of tables in the warehouse.
6. Information Delivery: Metadata for reporting tools, predefined queries, and OLAP
data models.
Page | 51
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Metadata Repository
Types of Metadata
1. Business Metadata:
o Focused on end-users, helping them understand the data in terms of business
context.
o Examples: Definitions of warehouse tables and columns in business language,
predefined reports, data load schedules, and user access permissions.
2. Technical Metadata:
o Needed by developers and administrators to manage, maintain, and optimize
the data warehouse.
o Examples: Source system models, transformation rules, data cleansing and
loading procedures, backup/recovery configurations.
Page | 52
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Information Navigator
An information navigator bridges the gap between business and technical metadata. It helps
users locate and interact with metadata effectively. Some key functions include:
1. Interface with Query Tools: Allows users to access metadata definitions via third-
party query tools.
2. Drill Down for Details: Enables navigating from high-level summaries (e.g., table
definitions) to detailed metadata (e.g., attribute specifications).
3. Review Queries and Reports: Users can browse predefined queries and reports,
selecting and launching them with desired parameters.
Standardization is essential for seamless metadata exchange across tools and processes in a
data warehouse. Two widely accepted standards include:
Page | 53
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
Implementation Options
1. Goals: What is the purpose of metadata in your organization? (E.g., ease of access,
data lineage tracking).
Page | 54
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B
2. Sources: Where will metadata come from? (E.g., source systems, staging areas).
3. Maintenance: Who will update and manage metadata?
4. Usage: How will metadata be used, and by whom? (E.g., business users for reports,
developers for optimization).
5. Tools: What tools are needed for effective metadata management?
Modern data warehouses increasingly rely on web-based tools for metadata navigation. Using
intranets, business users can:
Page | 55
DEPT OF CSE (DS)