Data Warehouse
Data Warehouse
A Data Warehouse (DW) is a centralized repository that stores integrated, historical, and
current data from multiple sources for business intelligence (BI), reporting, and analysis.
Unlike traditional databases, data warehouses are optimized for analytical processing (OLAP)
rather than transactional operations (OLTP).
Supports star schema, snowflake schema, or hybrid models for efficient querying.
4. Metadata & Management Layer
Stores metadata about:
o Data sources, relationships, and transformations.
o Mobile applications
3. Data Transformation:
o Converting data formats, standardizing units, and applying business rules.
4. Data Loading:
o Storing the processed data in a data warehouse, data lake, or database.
Example:
A retail company integrates sales data from online and offline stores into a single data
warehouse to analyze total revenue.
Tools for Data Integration:
✔ Apache Nifi
✔ Talend
✔ Informatica
✔ Microsoft SSIS
✔ AWS Glue
2. Data Transformation
Definition:
Data transformation is the process of converting raw data into a meaningful and usable
format. It includes data cleansing, standardization, filtering, aggregation, and
enrichment before storing it in a target system.
Key Types of Data Transformation:
1. Format Conversion:
o Changing data types (e.g., String → Integer, JSON → CSV).
2. Data Normalization:
o Standardizing values (e.g., converting all dates to YYYY-MM-DD format).
3. Data Aggregation:
o Summarizing data (e.g., calculating total sales per month).
4. Data Deduplication:
o Removing redundant records to avoid inconsistencies.
5. Data Filtering:
o Removing irrelevant or incomplete data.
6. Data Enrichment:
o Adding missing information by merging external data sources.
Example:
A banking system transforms customer transaction data by converting all currency values
into USD, filtering invalid transactions, and aggregating total spending per customer.
Tools for Data Transformation:
✔ Apache Spark
✔ Pandas (Python)
✔ SQL-based transformations
✔ DBT (Data Build Tool)
Data Cleaning
Definition:
Data cleaning (also known as data cleansing or data scrubbing) is the process of detecting,
correcting, and removing errors, inconsistencies, and inaccuracies in a dataset. The goal
is to ensure that the data is accurate, complete, reliable, and ready for analysis.
4. Removing Outliers
Issue: Extreme values can distort analysis (e.g., salary dataset with a value of
$1,000,000,000).
Solutions:
o Use statistical methods (e.g., Z-score, IQR method) to detect outliers.
7. Data Validation
Issue: Invalid or incorrect data entries (e.g., negative age values).
Solution: Apply validation rules and constraints (e.g., age should be between 0-120).
Tools: SQL (CHECK constraints), Pandas validation functions.
What is OLAP?
Online Analytical Processing (OLAP) is a technology that allows users to perform complex
analytical queries on large datasets efficiently. It is used for business intelligence,
reporting, and decision-making by organizing data into a multidimensional format for
faster analysis.
Key Features of OLAP:
✔ Stores historical data for trend analysis.
✔ Supports multi-dimensional data models.
✔ Enables fast query performance using pre-aggregated data.
✔ Used in data warehousing and business intelligence.
Query Type Complex queries (JOINs, GROUP BY) Simple read/write queries