0% found this document useful (0 votes)
26 views

dw mod 3

The document outlines the architectural components of data warehousing, focusing on data acquisition, storage, and information delivery. It details the processes involved in each area, including data staging, database management, and user access through reporting tools. Additionally, it discusses the management and control of data warehouses, emphasizing the importance of metadata and system monitoring for efficient operations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

dw mod 3

The document outlines the architectural components of data warehousing, focusing on data acquisition, storage, and information delivery. It details the processes involved in each area, including data staging, database management, and user access through reporting tools. Additionally, it discusses the management and control of data warehouses, emphasizing the importance of metadata and system monitoring for efficient operations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

VISVESVARAYA TECHNOLOGICAL

UNIVERSITY
JNANA SANGAMA, BELGAVI-590018, KARNATAKA

DATA WAREHOUSING
(As per CBCS Scheme 2022)

Sub Code: BAD515B

PREPARED BY:
INDHUMATHI R (ASST.PROF DEPT OF DS (CSE), KNSIT)

DEPARTMENT OF COMPUTER SCIENCE (DATA SCIENCE) AND ENGINEERING

K.N.S INSTITUTE OF TECHNOLOGY


HEGDE-NAGAR, KOGILU ROAD,
THIRUMENAHALLI, YELAHANKA,
BANGALORE-560064
DATA WAREHOUSING BAD515B

MODULE 3
Chapter 7: Architectural Components

1. Understanding Data Warehouse Architecture


Architecture in Three Major Areas: It discusses the architecture in terms of data acquisition,
data storage, and information delivery.

The architectural components of a data warehouse into three main areas:

1. Data Acquisition
2. Data Storage
3. Information Delivery

1. Data Acquisition

 Source Data: This is the raw data coming from different sources. It could be from:

Page | 2
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

o External sources (like market trends, competitor analysis, or external


databases)
o Production sources (real-time operational systems used in daily business
processes)
o Archived/internal data (older or historical data from within the organization)
 Data Staging: This is the area where the source data is collected, cleaned, transformed,
and prepared for loading into the data warehouse. Data staging handles tasks like data
extraction, transformation, and loading (ETL). Think of this as the “preparation area”
before data is stored in the warehouse.
 Purpose: This component ensures that all data, regardless of the source, is standardized
and cleansed so it can be used effectively in the data warehouse.
 Real-Time Application: In an airline company, data from booking systems, customer
feedback, and external weather data are collected and prepared in the data staging area
before being stored.

2. Data Storage

 Data Warehouse DBMS (Database Management System): This is the central


repository that stores the processed and cleansed data in a structured format. It allows
for long-term data storage and is optimized for retrieval and analysis rather than
frequent updates.
 Metadata: Metadata is data about data. It describes the structure, origin, and usage of
the data in the warehouse. Metadata helps users understand what data is available,
where it came from, and how it’s organized.
 Data Marts: These are smaller, specialized subsets of the data warehouse. A data mart
focuses on a specific business area or department, like sales or finance, providing quick
access to data for those specific needs.
 Purpose: This area is crucial for storing and organizing the data for long-term analysis.
Data marts also improve performance by providing focused datasets for specific user
groups.

Page | 3
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Information Delivery

 Report/Query: This component enables users to create reports and run queries on the
data warehouse. It allows them to retrieve specific information based on their
requirements.
 OLAP (Online Analytical Processing): OLAP tools allow users to perform complex,
multi-dimensional analysis. With OLAP, users can perform actions like drilling down
into detailed data or rolling up for aggregated insights.
 Data Mining: This involves identifying patterns and correlations in the data, often
using algorithms to predict future trends or behaviors. Data mining helps organizations
uncover insights that are not immediately visible.
 Purpose: Information delivery ensures that the data stored in the warehouse is
accessible and useful to end-users, allowing for strategic decision-making.

Management & Control

 This component oversees the data warehouse's operations. It ensures that the data flow
between components is efficient, monitors data quality, and manages access to various
parts of the system.

2. Distinguishing Characteristics

Objectives and Scope of Data Warehouse

 Data warehouses are built to support strategic decision-making by providing a broad


view of business data. Unlike operational systems that support daily activities, data
warehouses are designed for analyzing historical data over time.

Data Content

 Data in a warehouse is mostly read-only, meaning it’s not regularly modified but is
stored for analysis. It’s integrated from multiple sources and represents historical data
rather than real-time, current data. This enables a complete view of long-term trends.

Page | 4
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Complex Analysis and Quick Response

 Data warehouses support complex, interactive analysis. Users can perform tasks like
“drilling down” to view detailed data, “rolling up” to see aggregated data, and “slicing
and dicing” to look at data from different perspectives. This functionality is critical for
making quick, strategic decisions.

Support for High Data Volumes

 Since data warehouses store years of historical data, they need to handle very high
volumes efficiently. This is particularly important in large organizations where the data
generated is vast.

Flexible and Dynamic

 Data warehouses need to be flexible and adaptable. As business needs evolve, new
requirements may emerge, so the architecture must allow for easy updates and
adjustments.

Metadata-Driven:

 Metadata plays a crucial role in managing and understanding the data warehouse.

3. Architectural Framework

1. Architecture Supporting Flow of Data:


At the Data Source Here the internal and external data sources form the source data
architectural component. Source data governs the extraction of data for preparation and
storage in the data warehouse. The data staging architectural component governs the
transformation, cleansing, and integration of data.
In the Data Warehouse Repository:
The data storage architectural component includes the loading of data from the staging
area and also storing the data in suitable formats for information delivery. The metadata
architectural component is also a storage mechanism to contain data about the data at
every point of the flow of data from beginning to end.

Page | 5
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

At the User End :


The information delivery architectural component includes dependent data marts,
special multidimensional databases, and a full range of query and reporting facilities,
including dashboards and scorecards

2. Management and Control Module:


Throughout this process, the management and control module monitors the data flow,
backs up the warehouse, and ensures that only authorized employees can access
sensitive data. If there is a failure in the data pipeline, this module detects and rectifies
the issue, ensuring the data remains available and accurate.

Page | 6
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

4. Technical Architecture
The example also indirectly refers to the technical architecture by showing the set of
functions (e.g., data extraction, transformation, storage, and retrieval) and services
provided within each architectural component.
Although specific tools weren’t named, the scenario emphasizes that the architecture
is designed first, then tools are selected based on the complexity and scope of the
platform's needs (e.g., whether sophisticated extraction tools are necessary based on the
data sources' variety).

Scenario: E-Commerce Platform - "ShopifyPlus"

Context:
ShopifyPlus is an online retail platform that sells a wide variety of products, including
electronics, apparel, and home goods. It operates in multiple countries and has millions of
users visiting its site daily. ShopifyPlus wants to leverage a data warehouse to provide
insights for decision-making, customer behavior analysis, inventory management, sales
forecasting, and marketing campaigns.

1. Data Acquisition

Overview:
Data acquisition involves extracting data from various sources, processing it, and loading it

Page | 7
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

into the data warehouse for analysis and reporting. For ShopifyPlus, data is extracted from
internal and external systems and stored in a staging area before being transformed.

Key Components:

 Data Sources:
ShopifyPlus gathers data from:
o Operational systems, such as transactional databases (e.g., order management
system, inventory management).
o Legacy systems, which might contain archived transaction history.
o ERP (Enterprise Resource Planning) system data, which consolidates sales and
supply chain data.
o External sources, like market trend data, social media analytics, and customer
feedback.
 Intermediary Data Stores:
During extraction, ShopifyPlus pulls data into temporary files for pre-processing. For
instance, they might merge data from various regional warehouses or split files by
product category before moving them into the staging area.
 Staging Area:
The staging area serves as the main preparation ground where all extracted data is
cleaned, transformed, and merged. For ShopifyPlus, this area processes:
o Product and inventory data: Cleansed for duplicates and inconsistencies.
o Customer transaction data: Merged across multiple regions to ensure unique
customer profiles.
o Sales data: Aggregated by time and region for trend analysis.

Functions and Services:

 Data Extraction:
ShopifyPlus applies filters to select relevant data from operational systems, like
extracting only high-value customer transactions for specific analysis. They create
intermediary files for merging similar data before moving to staging.
 Data Transformation:
o Cleaning and Deduplication: Removing duplicate customer records.

Page | 8
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

o Mapping: Mapping sales and customer data from multiple sources to a


common format for the data warehouse.
o Aggregation: Summing up daily transactions for each product.
o Resolving Inconsistencies: Standardizing formats for currency and date/time.
 Data Staging:
Once cleaned, the data in the staging area is backed up, sorted, and indexed for quick
loading into the data warehouse repository.

2. Data Storage

Overview:
Data storage involves loading the transformed data into a central repository, typically an
RDBMS (Relational Database Management System), where data can be stored in a structured
and organized way for analysis.

Key Components:

 Data Repository:
ShopifyPlus’s data warehouse is organized with relational databases where data is
stored in a star schema with fact and dimension tables.
o Fact Tables: Transaction details (e.g., sales amount, quantity sold).
o Dimension Tables: Customer profiles, product categories, geographic
regions, and time.

Functions and Services:

 Loading Data:
The initial data load involves a full refresh to populate tables, while incremental
updates occur daily to add new transactions and update existing records.
 Data Granularity:
ShopifyPlus keeps detailed data at the transaction level for analysis. However, they
also store aggregated data for faster access when running high-level reports.

Page | 9
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Backup and Recovery:


ShopifyPlus has a backup system in place to prevent data loss. Data is backed up
daily, and redundant systems ensure availability even if there is a hardware failure.

3. Information Delivery

Overview:
The information delivery area involves making data accessible to end-users through reports,
dashboards, and OLAP (Online Analytical Processing) cubes.

Key Components:

 Data Flow:
Data flows from the central warehouse to different data marts (e.g., for marketing,
finance, and sales teams) and directly to dashboards. For ShopifyPlus, key insights are
presented in real-time dashboards, ad hoc reports, and via OLAP cubes.
 Service Locations:
Query services are available on user desktops, and users can directly access
dashboards for sales and inventory data. ShopifyPlus uses an application server to
generate reports automatically at regular intervals.
 Data Stores for Information Delivery:

Page | 10
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

o Temporary Data Stores: Used to save query results that are frequently
accessed, such as customer segments based on purchase history.
o OLAP Cubes: OLAP cubes are built for multidimensional analysis, providing
quick access to sales by product category, region, and time.
o Dashboards and Scorecards: Real-time dashboards display current
inventory, recent sales, and revenue trends.

Functions and Services:

 Security and Access Control:


User roles determine access to data. For instance, marketing personnel can access
customer demographic data, while finance staff can access revenue and sales data.
 Query Optimization:
ShopifyPlus’s data warehouse is optimized to recognize and leverage aggregate
tables, allowing for faster query responses.
 Self-Service Report Generation:
Business users can generate ad hoc reports to explore specific data insights, such as
the impact of a marketing campaign on sales.
 OLAP for Complex Analysis:
ShopifyPlus enables users to perform complex analysis, such as analyzing sales
performance across different regions and seasons.
 Dashboards:
Real-time data flows into dashboards, giving decision-makers immediate visibility
into current trends like best-selling products or low-stock items.

Page | 11
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

4. Management & Control

Overview:
This area includes the management and administration of the data warehouse, ensuring
efficient operations, data integrity, and consistent performance.

Key Components:

 Metadata Management:
Metadata catalogs describe data sources, transformations, and structures in the data
warehouse, enabling traceability and governance.
 Monitoring and Optimization:
ShopifyPlus monitors usage to identify and improve bottlenecks in the system. For
example, they track query execution times and usage patterns to refine indexes and
tune query performance.
 Backup and Security:
Regular backups and data encryption ensure that ShopifyPlus's data is protected.
Security measures prevent unauthorized access, while audit trails record who accessed
what data.

Page | 12
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Functions and Services:

 Query Governance:
Query limits prevent system overload by restricting large or complex queries.
 Event Triggers:
Triggers alert administrators when data loads are complete, allowing teams to validate
data before it's available to users.
 Metadata Management:
ShopifyPlus stores metadata for every data element in the warehouse, including
source, transformation rules, and access permissions.
 System Monitoring and Fine-Tuning:
Continuous monitoring helps in improving performance. ShopifyPlus periodically
archives older data to maintain system efficiency and prevent data bloat.
 Audit Trails and Security Logs:
Audit trails are maintained for compliance and to track data lineage, ensuring that data
accuracy can be verified if needed.

Page | 13
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

5. Architectural Types

1. Centralized Corporate Data Warehouse

 Structure: All data is stored in a single, centralized data warehouse.


 Data Flow: Data is pulled from source systems, processed in a staging area, and then
moved to the centralized data warehouse, which is the sole source for information
delivery.
 End-User Access: Users access business intelligence directly from the centralized data
warehouse.
 Characteristics: There are no data marts (smaller, specialized data storage areas) in
this architecture, so all data resides in the main data warehouse, supporting a single
point of access for reporting and analysis.

2. Independent Data Marts

 Structure: Multiple, isolated data marts serve individual departments or purposes, with
no integration or connection between them.
 Data Flow: Each data mart is built separately and often uses data relevant only to its
specific department. There’s no staging area that combines or integrates these data
marts.
 End-User Access: Users in each department access business intelligence from their
respective data marts only.

Page | 14
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Characteristics: This architecture can emerge without an overall plan, leading to data
silos, where data is stored independently within each data mart, without a unified view
of enterprise data.

3. Federated Architecture

 Structure: A collection of data marts and/or data warehouses is loosely connected


through common data elements.
 Data Flow: Data from various sources is shared across the federation of data marts and
warehouses, using logical or physical integration to create a unified view.
 End-User Access: Users access business intelligence from the integrated data across
multiple sources.
 Characteristics: Federated architecture aims to achieve a “single version of truth” by
aligning common data across multiple data marts and warehouses, without creating a
centralized warehouse.

4. Hub-and-Spoke Architecture

 Structure: A centralized data warehouse acts as the main data source, with dependent
data marts that draw from this warehouse.
 Data Flow: Data flows from source systems to a staging area, then to the central data
warehouse. From there, it flows to dependent data marts, which serve specific user
needs.
 End-User Access: Users can access information either from the centralized data
warehouse or from the dependent data marts.

Page | 15
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Characteristics: This architecture allows data marts to specialize in specific areas


while maintaining centralized data management, which can help avoid data
redundancy.

5. Data-Mart Bus Architecture

 Scenario: ShopVerse operates without a distinct, single data warehouse. Instead, it has
a series of conformed data marts, each designed to serve specific business functions,
like Sales, Marketing, and Inventory.

Page | 16
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Data Flow: Data flows from source systems into a staging area, where it is standardized
and then distributed to the conformed data marts. Common business dimensions, like
“Customer” and “Product,” are shared across these data marts.
 End-User Access: End-users in various departments can access reports and analysis
based on data in the conformed marts, which are linked to provide a comprehensive
view.
 Characteristics: The collection of conformed data marts effectively serves as a unified
data warehouse, allowing departments to access interconnected data marts for
enterprise-wide insights.

Page | 17
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

CHAPTER 8: INFRASTRUCTURE AS THE FOUNDATION FOR DATA


WAREHOUSING

1. Importance of Infrastructure

Distinction between Architecture and Infrastructure:

 Architecture is the design or blueprint of the data warehouse, outlining the


structure, components, and data flow. It includes elements like data staging, data
storage, and information delivery.
 Infrastructure, on the other hand, is the foundational setup that supports this
architecture. It includes all the physical and operational resources required to
implement and maintain the architecture, such as hardware, software, networks,
and procedures.

Supporting Infrastructure:

 Infrastructure supports architecture by providing the essential resources and tools


that allow each architectural component (like data staging) to perform its functions.
For instance, the infrastructure provides the storage and computing power to handle
data transformation and cleansing in the staging area.

Page | 18
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Components of Data Warehouse Infrastructure:

 Physical Infrastructure: Includes the core hardware (servers, storage devices),


operating systems, network, and DBMS (database management system). It also
involves network software and vendor tools that help manage each part of the data
warehouse.
 Operational Infrastructure: Refers to the non-physical elements essential for
maintaining the data warehouse, such as:
o People: Personnel who manage and operate the data warehouse.
o Procedures: Business rules and operational procedures.
o Training: Training programs to keep the team updated on tools and processes.
o Management Software: Software for monitoring, managing, and optimizing
data warehouse operations.

Importance of Both Physical and Operational Infrastructure:

 While the physical infrastructure provides the technical foundation, the operational
infrastructure ensures smooth functioning. Both are crucial for data warehouse
efficiency; without the right operational support, even a well-designed physical setup
may not perform optimally.

Page | 19
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

2. Hardware Requirements

Infrastructure Considerations:

 When selecting hardware for a data warehouse, it's essential to evaluate how much
existing infrastructure can be leveraged and keep a modular approach for easy upgrades.
 Assess the capacity of the current systems to determine how much storage or processing
power is available or if new components are needed.

Hardware Selection Guidelines:

 Scalability: Ensure the hardware can grow to support increased data and user demands.
 Vendor Support and Stability: Opt for reliable hardware vendors that can offer robust
support.
 Platform Compatibility: Choose hardware compatible with the operating system and
database software used in the warehouse.

Operating System Criteria:

 Scalability: The OS must support growth in user base and query complexity.
 Security, Reliability, and Availability: The OS must provide a secure, stable, and
resilient environment.
 Preemptive Multitasking & Multithreading: For effective multitasking and
distribution of workloads across processors, crucial for handling multiple simultaneous
requests.

Platform Options:

 Mainframes: Typically outdated and not cost-effective for data warehousing, but may
be repurposed for small data marts if there’s spare capacity.
 Open System Servers (like UNIX): Common in data warehousing for their robustness
and support for parallel processing.
 NT Servers: Suitable for small to medium-sized warehouses but with limited parallel
processing capabilities.

Platform Setup Options:

Page | 20
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Single-Platform Option: A straightforward setup where all data warehouse functions


occur on one platform, generally ideal when legacy systems can handle the load.
However, many companies avoid it due to limited scalability and compatibility with
modern tools.
 Hybrid Approach: Common in most companies, where different tasks are distributed
across various platforms. For example, data extraction is often performed on the source
system’s platform, while staging and transformation occur on a central platform.

Data Acquisition and Staging Area:

 Data extraction should ideally happen on the original source platform, with subsequent
tasks like reformatting, merging, and preliminary cleansing done there.
 Major transformations, consolidation, and validation are best suited for a staging area
platform, which is where all data prepares for loading into the main data warehouse.
 In One of the Legacy Platforms: If most of your legacy data sources are on the same
platform and if extra capacity is readily available, then consider keeping your data
staging area in that legacy platform. In this option, you will save time and effort in
moving the data across platforms to the staging area.
 On the Data Storage Platform: This is the platform on which the data warehouse
DBMS runs and the database exists. When you keep your data staging area on this
platform, you will realize all the advantages for applying the load images to the
database. You may even be able to eliminate a few intermediary substeps and apply
data directly to the database from some of the consolidated files in the staging area.
 On a Separate Optimal Platform: You may review your data source platforms,
examine the data warehouse storage platform, and then decide that none of these

Page | 21
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

platforms are really suitable for your staging area. It is likely that your environment
needs complex data transformations. It is possible that you need to work through your
data thoroughly to cleanse and prepare it for your data warehouse. In such
circumstances, you need a separate platform to stage your data before loading to the
database

Page | 22
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Data Movement Options

 Shared Disk: Common disk area accessible by multiple platforms.


o Example: Two departments in a company share sales and inventory data using
a shared network drive.
 Mass Data Transmission: Massive data transfer between platforms using data ports.
o Example: Exporting gigabytes of transactional data from an on-premises
database to the cloud over high-speed network ports.
 Real-Time Connection: Platforms directly interact in real-time to exchange or process
data.
o Example: A banking app connects to a real-time fraud detection system hosted
on a different server.
 Manual Methods: Data is transferred using physical media like tapes or disks.
o Example: Transferring archival data from old systems using external hard
drives.

Client/Server Architecture

 Components:
o Desktop Clients: User-facing layer (e.g., browser for reports, dashboards).
o Application Servers: Middleware for connectivity, metadata, authentication,
OLAP, and queries.

Page | 23
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

o Database Server: Hosts the data warehouse database.


 Example:
o Client: Analyst accessing reports via a web app.
o Application Server: Runs middleware for generating dynamic reports.
o Database Server: Stores cleaned sales, customer, and inventory data.

Server Hardware Architectures

 Symmetric Multiprocessing (SMP):


o Shared-everything architecture with a single memory bus.
o Example: A small company’s 200 GB warehouse using multiple processors
sharing the same memory.

Page | 24
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

o Features:
 This is a shared-everything architecture, the simplest parallel processing
machine.
 Each processor has full access to the shared memory through a common
bus.
 Communication between processors occurs through common memory.
Disk controllers are accessible to all processors.
o Benefits:
 This is a proven technology that has been used since the early 1970s. It
provides high concurrency.
 You can run many concurrent queries.
 It balances workload very well.
 It gives scalable performance; simply add more processors to the system
bus.
 Being a simple design, you can administer the server easily.
o Limitations: Available memory may be limited.
 Performance may be limited by bandwidth for processor-to-processor
communication, I/O, and bus communication.
 Availability is limited; like a single computer with many processors
 Clusters:
o Nodes with independent memory, sharing a common disk.
o Example: A medium-sized retail chain ensuring system availability through
clustered servers.

Page | 25
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

o Features:
 Each node consists of one or more processors and associated memory.
 Memory is not shared among the nodes; it is shared only within each
node. Communication occurs over a high-speed bus.
 Each node has access to the common set of disks.
 This architecture is a cluster of nodes.
o Benefits:
 This architecture provides high availability; all data is accessible even if
one node fails.
 It preserves the concept of one database.
 This option is good for incremental growth.
o Limitations:
 Bandwidth of the bus could limit the scalability of the system.
 This option comes with a high operating system overhead.
 Each node has a data cache; the architecture needs to maintain cache
consistency for internode synchronization. A cache is a “work area”
holding currently used data; the main memory is like a big file cabinet
stretching across the entire room

Page | 26
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Massively Parallel Processing (MPP):


o Shared-nothing architecture with nodes processing their data independently.
o Example: A social media platform analyzing user interactions across terabytes
of data.

o Features:
 This is a shared-nothing architecture.
 This architecture is more concerned with disk access than memory
access.
 It works well with an operating system that supports transparent disk
access.
 If a database table is located on a particular disk, access to that disk
depends entirely on the processor that owns it.
 Internode communication is by processor-to-processor connection.
o Benefits:
 This architecture is highly scalable.
 The option provides fast access between nodes.
 Any failure is local to the failed node; this improves system availability.
 Generally, the cost per node is low.
o Limitations:
 The architecture requires rigid data partitioning.
 Data access is restricted.
 Workload balancing is limited.
 Cache consistency must be maintained

Page | 27
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 NUMA (Nonuniform Memory Access):


o Combines SMP with improved scalability.
o Example: A global financial firm using NUMA for complex OLAP queries
and partitioned data warehouses.

 Features:
o This is the newest architecture; it was developed in the 1990s.
o The NUMA architecture is like a big SMP broken into smaller SMPs that are
easier to build
o The hardware considers all memory units as one giant memory. The system
has a single real memory address space over the entire machine; memory
addresses begin with 1 on the first node and continue on the following nodes.
Each node contains a directory of memory addresses within that node.
o In this architecture, the amount of time needed to retrieve a memory value
varies because the first node may need the value that resides in the memory of
the third node. That is why this architecture is called nonuniform memory
access architecture.
 Benefits:
o Provides maximum flexibility.
o Overcomes the memory limitations of SMP.
o Better scalability than SMP.
o If you need to partition your data warehouse database and run these using a
centralized approach, you may want to consider this architecture. You may
also place your OLAP data on the same server.

Page | 28
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Limitations:
o Programming NUMA architecture is more complex than even with MPP.
o Software support for NUMA is fairly limited.
o The technology is still maturing

Client Workstation Considerations

 Casual Users: Machines supporting basic HTML reports (e.g., store managers viewing
daily sales).
 Power Users: Machines supporting OLAP and dashboards (e.g., analysts creating
predictive models).

Evolution of Data Warehouse Platforms

 Stages:
o Initial: Data staging and storage on the same platform.
o Growing: Segregated staging and storage platforms.
o Matured: Specialized platforms for development, staging, and storage.
 Example: A startup begins with a single server, but as it grows, it transitions to cloud-
based storage and processing for scalability.

Page | 29
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

3. Software Requirements

Features of Leading Commercial RDBMS for Data Warehousing

Commercial Relational Database Management Systems (RDBMS), originally designed for


operational systems like OLTP, have evolved to include features for decision support systems
such as data warehouses. Key enhancements include:

1. Data Acquisition Support

Modern RDBMS products assist in acquiring and integrating data from diverse sources:

 Mass Data Loading: Vendors offer utilities that handle the bulk import of large
datasets, reducing manual effort and improving efficiency.
 Data Retrieval from Heterogeneous Systems: Features like database links and
connectors enable seamless access to data stored in various systems, whether structured
or unstructured.
 Data Transformation: Advanced tools embedded in the RDBMS support cleaning,
merging, and enriching data as part of the ETL process.

2. Replication and Incremental Loading

To address the need for consistent updates in data warehouses:

 Replication Features: Enable continuous or scheduled data synchronization across


multiple systems.
 Incremental Loading: Allows only new or updated records to be processed, reducing
the overall ETL execution time.

3. Enhanced Indexing for Query Optimization

Indexing is crucial in data warehouses where queries often access large datasets:

 Bit-Mapped Indexes:
o Ideal for columns with fewer distinct values, such as gender or region codes.
o Significantly accelerates queries involving conditions like grouping or filtering
on these fields.

Page | 30
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

o Example: A query filtering customers based on region will benefit from faster
retrieval using a bit-mapped index.

4. Load Balancing

With heavy query traffic in data warehouses, load balancing ensures optimal resource
utilization by:

 Distributing query workloads across multiple processors.


 Avoiding performance bottlenecks during peak usage.

5. Parallel Processing Options

These enhance speed by distributing tasks across multiple processors.

(a) Interquery Parallelization

Processes multiple queries concurrently but executes each serially.


Example:
Simultaneously running queries for sales in Europe and Asia.

(b) Intraquery Parallelization

Splits a single query into smaller operations (e.g., reading, joining, sorting) and executes
them in parallel.
Example:
A single query analyzing total sales across all regions is processed by dividing tasks like data
reads and aggregations.

(c) Horizontal Parallelism

Data is partitioned across disks, and tasks like reading are parallelized.
Example:
Reading customer data from servers in different regions.

Page | 31
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

(d) Vertical Parallelism

Different tasks (read, join, sort) are executed simultaneously in a pipeline.


Example:
While reading customer data, another task starts sorting it.

(e) Hybrid Method

Combines horizontal and vertical parallelism for optimal performance.


Example:
Amazon analyzes customer behavior by reading, joining, and sorting data in parallel across
regions and tasks.

5. Selection of DBMS

Choosing the right DBMS depends on features like:

 Query Governor: Prevents inefficient queries.


 Query Optimizer: Makes queries run faster.
 Load Utility: Supports bulk loading.
 Metadata Management: Maintains a catalog of data.
 Scalability and Portability: Supports large volumes of data and works across
platforms.

Page | 32
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Application:
Amazon selects a DBMS with high scalability to handle millions of transactions daily and
supports portability for deployment on multiple cloud platforms.

6. Collection of Tools

Data warehouse development requires several tools for:

 Data Acquisition: Extracting, transforming, and loading (ETL) tools.


 Data Storage: Managing the warehouse.
 Information Delivery: Dashboards, reports, alerts, and scorecards.

Application:
Amazon uses:

 ETL tools like Informatica to gather and clean data.


 OLAP tools to provide multidimensional views of sales data.
 Dashboards for sales and marketing teams to monitor KPIs.

Page | 33
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

1. Architecture First, Then Tools

 Principle:
Always design the architecture before selecting the tools. Architecture defines the
structure, functions, and services of the data warehouse components, ensuring they
meet business requirements. Tools should then be chosen to align with these defined
functions.
 Why This is Important:
If tools are chosen first, they may not align with the actual needs of the architecture.
For example, if a report generation tool is selected prematurely, it might fail to
support complex queries or visualization needs of power users, leaving the data
warehouse ineffective.

2. Types of Software Tools for Data Warehousing

Data warehousing tools are categorized based on the tasks they perform, such as data modeling,
extraction, transformation, loading, quality assurance, and analytics. Each tool type has distinct
features.

a. Data Modeling Tools

 Purpose: Create and maintain data models for source systems, staging areas, and the
data warehouse.
 Features:
o Forward engineering: Generate database schemas.
o Reverse engineering: Create models from existing databases.
o Support dimensional modeling like STAR schemas.

b. Data Extraction Tools

 Purpose: Extract data from source systems for loading into the warehouse.
 Extraction Methods:
o Bulk extraction: Full data refresh.
o Change-based replication: Incremental data loads.

Page | 34
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Factors Influencing Tool Selection:


o Source systems' platforms and databases.
o Built-in extraction capabilities of the source systems.

c. Data Transformation Tools

 Purpose: Convert extracted data into formats suitable for loading into the warehouse.
 Features:
o Splitting fields, consolidating data, standardization, and deduplication.
o Applying default values as needed.

d. Data Loading Tools

 Purpose: Load transformed data into the data warehouse.


 Features:
o Generate primary keys during loading.
o Use pre-coded procedures for efficiency.

e. Data Quality Tools

 Purpose: Detect and correct data errors to ensure accuracy.


 Features:
o Work on staging area or source systems.
o Resolve inconsistencies and prepare clean load images.

f. Query and Reporting Tools

 Purpose: Enable users to create and run queries and reports.


 Types:
o Report writers: Allow custom, graphic-intensive reports.
o Report servers: Centralized query handling.

g. Dashboards

 Purpose: Provide interactive, real-time visual insights.


 Features:
o Drill-down capabilities.

Page | 35
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

o Parameter customization and flexible data displays.

h. Scorecards

 Purpose: Track performance metrics against targets.


 Features:
o Display key performance indicators (KPIs).
o Compare current, past, and target performances.

i. Online Analytical Processing (OLAP) Tools

 Purpose: Support complex dimensional queries for analysis.


 Types:
o MOLAP: Uses proprietary multidimensional databases.
o ROLAP: Operates on relational databases of the warehouse.

j. Alert Systems

 Purpose: Notify users about critical events or exceptions.


 Types: Alerts from:
o Individual source systems.
o Enterprise-wide data warehouses.
o Specific data marts.

k. Middleware and Connectivity Tools

 Purpose: Ensure seamless communication between systems in heterogeneous


environments.
 Features:
o Cross-platform database access.
o Integration between warehouse components.

l. Data Warehouse Administration Tools

 Purpose: Simplify the management of data warehouses.


 Features:
o Monitor load histories and query usage.

Page | 36
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

o Automate maintenance tasks.

3. Data Warehouse Appliances

 Definition:
A DW appliance is an integrated device combining hardware, software, and storage,
designed specifically for data warehousing tasks.

a. Evolution of DW Appliances

 Initial setups involved piecemeal combinations of hardware, operating systems, and


DBMS meant for transactional systems, upgraded for warehousing.
 Modern appliances integrate components into cohesive systems, supporting high
scalability and parallel processing.

b. Benefits of DW Appliances

1. Cost Reduction:
o Lower initial setup and maintenance costs.
o Reduced support costs with single-vendor solutions.
2. Performance Improvement:
o Mixed workloads (queries, reports, analysis) run efficiently.
o Parallel processing reduces time for queries and reports.
3. High Availability:
o Built-in redundancies, such as backup networks and disk mirroring.
4. Reduced Administration:
o Automated processes minimize troubleshooting and maintenance.
5. Scalability:
o Modular designs allow easy addition of capacity and performance.
6. Reliability:
o Homogeneous systems eliminate integration issues.
7. Faster Implementation:
o Reduced need for testing and integration speeds up deployment.

Page | 37
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

CHAPTER 9: THE SIGNIFICANT ROLE OF METADATA

1. Understanding Metadata

What is Metadata?

Definition: Metadata is "data about data." It provides information about the data stored in the
data warehouse, such as its source, structure, format, updates, and how it is processed.

Why Metadata is Important

1. For Using the Data Warehouse


o Users (like business analysts) need to understand the available data, its
definitions, and its context to perform accurate analysis.
o Without metadata, users might misunderstand the data and draw incorrect
conclusions.
2. For Building the Data Warehouse

 Developers need metadata to map, extract, and transform data from source systems
into the data warehouse.
 Metadata defines the structure of both the source data and the target warehouse.

3. For Administering the Data Warehouse

Page | 38
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Metadata helps maintain, enhance, and expand the data warehouse by tracking data
sources, update cycles, and structural changes.
 It assists in monitoring query performance, handling storage upgrades, and disaster
recovery.

Page | 39
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Metadata as the Nerve Center

Metadata acts as a communication hub that connects various components of the data
warehouse, such as:

1. Source Systems: Provides details about where data originates.


2. ETL (Extract, Transform, Load): Tracks how data is extracted, cleansed, and
transformed before loading into the warehouse.
3. Storage: Describes the logical and physical structure of the data warehouse.
4. User Access: Defines how users can query and retrieve data, ensuring security and
performance.

Why Metadata is Vital for End-Users

 Metadata provides users with tools to browse, search, and query the warehouse
effectively.
 It prevents misinterpretations by clarifying data definitions, relationships, and usage
rules.

Page | 40
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Why Metadata is Vital for Developers

 Developers rely on metadata to build and maintain the warehouse. It provides details
about source-to-target mappings, transformation rules, and data refresh cycles.

Examples of Metadata Questions

To emphasize its practical value, consider these metadata-driven questions:

1. "Where did the data come from?"


Metadata identifies source systems, e.g., sales data pulled from both POS systems and
online stores.

Page | 41
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

2. "When was the data last updated?"


Metadata tracks refresh schedules, e.g., the warehouse updated sales figures last night
at 11 PM.
3. "What fields are included in this dataset?"
Metadata lists data attributes, e.g., the "Customer" table contains fields like Name,
Email, Purchase History, and Loyalty Status.
4. "How is data quality ensured?"
Metadata records cleansing rules and deduplication schedules, e.g., duplicate
customer records removed weekly.

Architectural Framework of Data Warehousing

The architectural framework provides a blueprint of how the data warehouse components
interact to support business requirements. It is a high-level design that outlines the overall
structure, guiding principles, and data flow within the warehouse.

Key Components

1. Source Systems:
These are operational systems like CRM, ERP, or transactional databases that feed
data into the warehouse.
Example: Sales data from a POS system or customer data from a CRM platform.
2. ETL Processes:
ETL (Extract, Transform, Load) tools extract data from source systems, transform it
into a usable format, and load it into the data warehouse.
Example: Extracting product sales from multiple stores, converting currencies into a
single standard, and loading them into the warehouse.
3. Data Warehouse Storage:
This is the central repository where cleaned, integrated, and consolidated data is
stored for analysis.
Example: A database that stores daily sales totals by product, store, and region.
4. Data Marts:
These are smaller, subject-specific databases designed for specific business areas like
sales, marketing, or finance.

Page | 42
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Example: A marketing data mart focusing on campaign performance and customer


demographics.
5. Front-End Tools:
These tools enable users to interact with the data warehouse through queries,
dashboards, and reports.
Example: A business analyst uses a BI tool like Tableau to create a report comparing
monthly sales to targets.

Technical Architecture of Data Warehousing

Technical architecture focuses on the tools, technologies, and infrastructure that support
the implementation and operation of the data warehouse. It addresses how data is processed,
stored, and accessed at a technical level.

Key Layers

1. Data Source Layer:


The origin of raw data. It could include databases, flat files, APIs, or streaming data.
Example: Data from a retailer's online store and physical stores.
2. Integration Layer:
The ETL tools and processes for extracting, transforming, and loading data. This layer
includes data cleansing, validation, and integration.
Example: An ETL tool like Informatica handles merging data from multiple sales
channels.

Page | 43
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

3. Storage Layer:
Includes the data warehouse database, often implemented using relational databases
(e.g., Oracle, Snowflake) or cloud platforms (e.g., AWS Redshift).
Example: A Snowflake cloud-based warehouse storing sales data partitioned by year.
4. Processing Layer:
Analytical processing is performed using tools like OLAP (Online Analytical
Processing).
Example: Precomputing aggregated sales data for faster query performance.
5. Access Layer:
Enables end-users to access data through BI tools, dashboards, and APIs.
Example: Analysts use Power BI to create revenue trend visualizations.

Key Features:

 Scalability: Supports increasing data volumes and complexity.


 Security: Ensures data protection through role-based access and encryption.
 Performance: Optimizes query response times using indexing, partitioning, and in-
memory processing.

Metadata and Functional Areas in a Data Warehouse:

1. Data Acquisition

Processes:
This area involves acquiring data from various sources and preparing it for storage in the data
warehouse. Key processes include:

 Data Extraction: Pulling raw data from diverse source systems.


 Data Transformation: Converting data into a usable format.
 Data Cleansing: Removing inconsistencies or inaccuracies.
 Data Integration: Combining data from different sources.
 Data Staging: Temporarily storing processed data before final loading.

Metadata Types:
The metadata recorded here provides detailed information about the source systems and
transformation rules. Examples include:

Page | 44
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Source system platforms and models: Logical and physical structures of data
sources.
 Data extraction methods: Techniques used to extract data.
 Transformation and cleansing rules: Definitions for modifying and cleaning data.
 Summarization rules: Rules for creating aggregated data like totals or averages.
 Source-to-target mappings: Relationships between source and target systems.

2. Data Storage

Processes:
Once data is acquired, it is loaded into the data warehouse and managed for long-term use.
Processes include:

 Data Loading: Transferring data into the warehouse.


 Data Archiving: Storing older data for historical reference.
 Data Management: Ensuring efficient storage, backup, and retrieval.

Metadata Types:
Metadata in this area relates to data loading and storage management, such as:

Page | 45
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Source and target logical/physical models: Representation of data structures.


 Source-target mapping: Defines how data from sources is mapped to targets.
 Load procedures and statistics: Information on initial and incremental loads.
 Backup/recovery procedures: Guidelines for data recovery during failures.
 Storage allocations and archival rules: Allocation of storage resources and rules for
archiving data.

3. Information Delivery

Processes:
This area is focused on making data accessible to end-users through queries, reports, and
analytical tools. Processes include:

 Report Generation: Creating static and dynamic reports for analysis.


 Query Processing: Enabling users to extract specific data.
 Complex Analysis: Supporting advanced analytics like trend analysis or forecasting.

Metadata Types:
Metadata here supports reporting and analytics, including:

 Predefined queries and reports: Templates for common queries.

Page | 46
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 OLAP content: Structures and hierarchies for online analytical processing.


 Target data definitions in business terms: Simplified descriptions for end-users.
 Query templates and navigation methods: Guidelines for creating queries.

4. Business Metadata vs. Technical Metadata


 Business Metadata

Purpose:
Business metadata is designed to help end-users (e.g., managers, analysts) understand and
utilize the data warehouse by expressing data in meaningful, non-technical terms.

Key Characteristics:

1. Less structured compared to technical metadata.


2. Captures informal data from textual documents, spreadsheets, and business rules.
3. Provides business definitions and context to make data accessible and usable.
4. Expresses predefined queries, reports, data transformation rules, and data sources in
plain language.

Examples of Business Metadata:

 Connectivity procedures and access privileges.


 Data definitions (tables, columns) in business terms.

Page | 47
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

 Predefined queries and reports.


 Data transformation rules and mappings.
 Data ownership details.
 Report distribution information.
 Rules for OLAP analysis and data currency.

Questions Addressed by Business Metadata:

 How can I connect to the data warehouse?


 Which data warehouse parts can I access?
 What are the definitions of attributes in my query?
 Are there predefined reports or queries for my requirements?
 What is the source of specific data items?
 How current is the OLAP data?
 How is data derived or aggregated?

Who Benefits:

 Managers, analysts, power users, casual users, and executives use business metadata
for efficient decision-making and analysis.

 Technical Metadata

Purpose:
Technical metadata supports IT professionals (e.g., developers, administrators) in building,
maintaining, and administering the data warehouse.

Key Characteristics:

1. Structured and detailed information about processes and data structures.


2. Supports initial development, ongoing growth, and administration of the warehouse.
3. Includes rules for extraction, transformation, and loading (ETL) processes.
4. Acts as a comprehensive guide for IT tasks like backups, monitoring, and version
control.

Page | 48
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Examples of Technical Metadata:

 Data models of source systems and mappings (source → staging → data warehouse).
 Data extraction, transformation, and cleansing rules.
 Data loading schedules and controls.
 Physical and logical database designs.
 Summarizations, aggregations, and derivation rules.
 Authority/access privileges.
 Connectivity information (e.g., network/server details).
 Purge and archival rules.

Questions Addressed by Technical Metadata:

 What are the databases and tables available?


 What columns exist in each table? What are their keys and indexes?
 What are the data extraction rules and schedules?
 What are the mappings and transformations for source-to-target data?
 What default values were applied during data cleansing?
 When was the last update for the data items?
 What are the schedules for data refresh, OLAP updates, or data archiving?
 Which query/reporting tools are available for use?

Who Benefits:

 IT staff like data architects, ETL developers, database administrators, and system
administrators rely on technical metadata for effective implementation and
management.

1. Project Manager: Helps plan and monitor the project, understanding data
dependencies.
2. Data Warehouse Administrator: Ensures smooth operations by maintaining
metadata records.
3. Database Administrator: Manages and optimizes database performance using
metadata.
4. Metadata Manager: Oversees metadata collection and integration.

Page | 49
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

5. Data Warehouse Architect: Uses metadata to design efficient data warehouse


structures.
6. Data Acquisition Developer: Extracts and integrates data from source systems using
metadata.
7. Data Quality Analyst: Checks data consistency and accuracy based on metadata
rules.
8. Business Analyst: Uses metadata to understand data lineage and create meaningful
reports.
9. System Administrator: Manages system configurations and interactions.
10. Infrastructure Specialist: Plans hardware and software infrastructure using
metadata.
11. Data Modeler: Designs logical and physical data models with metadata insights.
12. Security Architect: Implements data security based on metadata structure.

How to Provide Metadata

Metadata in a data warehouse is essential for understanding and managing the system. Here's
how it's provided:

1. Capturing Metadata: Tools record metadata during processes like extraction,


transformation, and loading (ETL).
2. Standardization: Metadata must be consistent across all tools and processes.
3. Integration: Metadata from various tools (e.g., ETL, query tools) should be unified.
4. Synchronization: Updates or changes in metadata should reflect throughout the
system.
5. User Accessibility: Metadata should be presented in an easy-to-understand format for
end-users.

Challenges in Metadata Management

Metadata management is crucial but challenging due to the following reasons:

Page | 50
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

1. Proprietary Formats: Each tool uses its own metadata format, making integration
difficult.
2. Lack of Standards: No universally accepted metadata standards exist.
3. Centralized vs. Decentralized: Debate over a single repository versus multiple
fragmented stores.
4. Version Control: Keeping metadata consistent over time is tedious.
5. Conflicting Definitions: Different source systems may use varying naming
conventions, definitions, and rules.

Metadata Requirements

To be effective, metadata should:

1. Be Time-Variant: Track historical changes in data structures and rules.


2. Integrate Sources: Collect metadata from operational systems, ETL tools, and user
interfaces.
3. Ensure Synchronization: Align metadata across all tools and processes.
4. Facilitate Exchange: Allow easy sharing of metadata between tools.
5. Be User-Friendly: Present metadata in simple formats for non-technical users.

Sources of Metadata

Metadata originates from multiple points in the data warehouse lifecycle, such as:

1. Source Systems: Data models, file layouts, and documentation from operational
systems.
2. Data Extraction: Details on selected fields, extraction schedules, and methods.
3. Data Transformation: Rules for converting, validating, and auditing data.
4. Data Loading: Specifications for loading data into the warehouse, including audit
trails and schedules.
5. Data Storage: Metadata on the structure and grouping of tables in the warehouse.
6. Information Delivery: Metadata for reporting tools, predefined queries, and OLAP
data models.

Page | 51
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Metadata Repository

A metadata repository serves as a comprehensive catalog or directory to classify, store, and


manage metadata. It acts as the backbone of a data warehouse, enabling effective
communication and access to data and metadata by various users (business users,
administrators, and developers).

Types of Metadata

Metadata can be broadly classified into:

1. Business Metadata:
o Focused on end-users, helping them understand the data in terms of business
context.
o Examples: Definitions of warehouse tables and columns in business language,
predefined reports, data load schedules, and user access permissions.
2. Technical Metadata:
o Needed by developers and administrators to manage, maintain, and optimize
the data warehouse.
o Examples: Source system models, transformation rules, data cleansing and
loading procedures, backup/recovery configurations.

Page | 52
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

Information Navigator

An information navigator bridges the gap between business and technical metadata. It helps
users locate and interact with metadata effectively. Some key functions include:

1. Interface with Query Tools: Allows users to access metadata definitions via third-
party query tools.
2. Drill Down for Details: Enables navigating from high-level summaries (e.g., table
definitions) to detailed metadata (e.g., attribute specifications).
3. Review Queries and Reports: Users can browse predefined queries and reports,
selecting and launching them with desired parameters.

Requirements for a Metadata Repository

To be effective, a metadata repository should meet these criteria:

1. Flexible Organization: Support logical classification of metadata for easy access.


2. Historical Perspective: Enable versioning for tracking changes in metadata.
3. Integration: Store business and technical metadata in a unified format for different
user needs.
4. Compartmentalization: Separate logical and physical database models.
5. Analysis Capabilities: Allow users to browse and navigate relationships within
metadata.
6. Customization: Provide tailored views and support for new metadata objects.
7. Standardized Naming: Ensure consistency in metadata naming conventions.
8. Synchronization: Keep metadata consistent across all systems and tools.
9. Openness: Enable interoperability with tools using industry-standard interfaces.

Metadata Integration and Standards

Standardization is essential for seamless metadata exchange across tools and processes in a
data warehouse. Two widely accepted standards include:

Page | 53
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

1. Common Warehouse Metamodel (CWM): Provides a framework for managing


metadata, including transformations and OLAP-specific metadata types.
2. Open Information Model (OIM): Focuses on database schema management and
data lineage.

These standards enable:

 Schema management: Consistent handling of source and target schemas.


 Transformation details: Capturing the specifics of data extraction, transformation,
and loading.
 Data lineage tracking: Tracing data flow from source to warehouse.

Implementation Options

Organizations can implement metadata management in various ways:

1. Centralized Repository: A single repository accessible to all tools. While it promotes


sharing, managing a centralized system in large-scale environments can be
challenging.
2. Decentralized Approach: Metadata is distributed across systems, allowing autonomy
but requiring robust integration strategies.
3. Custom Solutions: Organizations create their own databases or procedures to manage
metadata.
4. Integration with Tools: Use query tools and web interfaces to display metadata
alongside actual data.
5. Web-Based Access: Leverage intranet-based metadata reporting for user-friendly
navigation and reporting.

Key Questions for Metadata Initiatives

When designing metadata management, address the following:

1. Goals: What is the purpose of metadata in your organization? (E.g., ease of access,
data lineage tracking).

Page | 54
DEPT OF CSE (DS)
DATA WAREHOUSING BAD515B

2. Sources: Where will metadata come from? (E.g., source systems, staging areas).
3. Maintenance: Who will update and manage metadata?
4. Usage: How will metadata be used, and by whom? (E.g., business users for reports,
developers for optimization).
5. Tools: What tools are needed for effective metadata management?

Metadata Access via Web

Modern data warehouses increasingly rely on web-based tools for metadata navigation. Using
intranets, business users can:

1. Browse warehouse metadata through web browsers.


2. Navigate through content like data marts and OLAP cubes.
3. Generate reports and analyze data interactively.

Page | 55
DEPT OF CSE (DS)

You might also like