0% found this document useful (0 votes)

3 views

Data Repositories in Data Analytics

Data repositories

Uploaded by

sandhiyarajeswari.k

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Data Repositories in Data Analytics

Data repositories

Uploaded by

sandhiyarajeswari.k

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Data Repositories in Data Analytics

Data repositories are centralized systems used to store, manage, and retrieve data for
analytical purposes. These repositories serve as the foundation for data-driven decision-making
in organizations by ensuring easy access to well-organized data. With the growing importance of
data analytics, repositories have evolved to accommodate different types of data, from structured
transactional records to unstructured multimedia content.

Data repositories are vital for businesses, government institutions, and research organizations, as
they enable secure and scalable management of ever-growing datasets. These systems are
designed to ensure data integrity, accessibility, and interoperability while supporting diverse
analytical processes like business intelligence, machine learning, and big data analysis.
Why do we need a Data Repository?

 It is vital to organize and analyze the data that is coming from different sources
 To pinpoint trends, you need to assess several years of historical data
 Restructuring data and
 Before the business users access the data, the information that is stored in the data repository is
more useful as it is already cleaned and optimized
 Data repositories ensure that all in the company are working with the single version of the truth
i.e same data
Challenges associated with Data Repository

 It is vital to make sure that the database management system has the scalability feature with the
data expansion, as any increase in the datasets can reduce the system’s speed
 It’s best to maintain a backup of all your databases, as in case of any systems crash, it may
negatively impact your data
 There might be the possibility of accessing sensitive data by unauthorized operators as the data
is stored in a single location. It is very challenging to implement security protocols on multiple
storage locations

Types of Data Repositories

1. Data Warehouses

 Description: Data warehouses are centralized systems optimized for storing historical,
structured data from multiple sources for reporting and analysis. They integrate data from
various operational systems into a unified format.A data warehouse is a centralized
repository that stores integrated data from multiple sources in a structured and consistent
format. It is designed to support business intelligence (BI) activities, such as reporting,
querying, and data analysis, enabling organizations to make data-driven decisions.

Key Features of a Data Warehouse:

o Subject-Oriented:Organized around major business subjects (e.g., sales,

customers, finance), rather than applications or processes.
o Integrated:Data from various sources (databases, flat files, applications) is
consolidated and standardized to ensure consistency.

2. Time-Variant:
o Stores historical data to enable trend analysis and long-term reporting. Data is
often timestamped.

3. Non-Volatile:
o Once data is loaded into the warehouse, it is read-only and not modified. Updates
occur through periodic data refreshes.

4. Optimized for Querying and Analysis:

o Structured to handle complex queries efficiently, rather than transactional
processing.

Data Warehouse Architecture:

1. Source Systems:
o Includes operational databases (e.g., CRM, ERP) and external data sources.

2. ETL Process (Extract, Transform, Load):

o Extract: Data is collected from various source systems.
o Transform: Data is cleaned, validated, and standardized to ensure consistency.
o Load: Transformed data is loaded into the data warehouse.

3. Data Warehouse:
o Centralized database optimized for storage and retrieval.

4. Data Marts:
o Subsets of the data warehouse that focus on specific business areas (e.g.,
marketing, sales).

5. BI Tools:
o Tools like dashboards, reports, and visualization software enable users to analyze
data.

Benefits of a Data Warehouse:

1. Centralized Data Access:

o Combines data from disparate systems into a single, unified platform.

2. Improved Data Quality:

o Data is cleaned and standardized during the ETL process.

3. Historical Analysis:
o Provides access to historical data for trend analysis and forecasting.
4. Efficient Query Performance:
o Designed to handle complex queries and large datasets quickly.

5. Supports Decision-Making:
o Enables organizations to generate insights and make informed decisions.

Example of a Data Warehouse:

An e-commerce company might use a data warehouse to combine:

 Sales data from the online store.

 Customer data from a CRM system.
 Inventory data from the supply chain.

This enables reporting on metrics like:

 Sales trends by region or product category.

 Customer purchase behavior.
 Inventory levels over time.

Popular Data Warehouse Tools:

 On-Premise: Oracle, Teradata, IBM Db2, Microsoft SQL Server

 Cloud-Based: Amazon Redshift, Google BigQuery, Snowflake, Microsoft Azure
Synapse

2.Data Cube

A data cube is a multi-dimensional data structure used in data warehousing and online
analytical processing (OLAP) to represent data for analysis. It organizes data into dimensions
and measures, allowing for efficient querying and analysis across multiple perspectives.

Key Concepts of a Data Cube:

1. Dimensions:
o These are the perspectives or categories by which data can be analyzed. Examples
include time, location, product, customer, etc.
o Dimensions form the axes of the cube.

2. Measures:
o These are the numerical values or facts to be analyzed, such as sales, profit,
revenue, quantity, etc.
o Measures are typically aggregated (e.g., sum, average) across dimensions.

3. Multidimensional Structure:
o A data cube is like a 3D spreadsheet where each cell contains aggregated data
corresponding to specific dimension values.
o For example, a cube with dimensions time (years), location (regions), and
product might show the total sales for "Product X" in "Region Y" during "Year
Z."

Example of a Data Cube:

Dimensions:

 Time: Years (2022, 2023, 2024)

 Location: Regions (North, South, East, West)
 Product: Categories (Electronics, Apparel, Furniture)

Measures:

 Sales Revenue

A single cell in the cube could represent:

 Total sales revenue for Electronics in the North region during 2023.

Operations on a Data Cube:

1. Slice: Extracts a 2D view of the cube by fixing one dimension (e.g., sales by region for
2023).
2. Dice: Extracts a smaller sub-cube by selecting specific values for multiple dimensions
(e.g., sales of Electronics and Furniture in North and South regions for 2023).
3. Roll-up: Aggregates data by climbing up a dimension hierarchy (e.g., aggregating daily
sales into monthly sales).
4. Drill-down: Breaks down aggregated data into finer levels (e.g., from yearly sales to
monthly sales).
5. Pivot: Rotates the cube to view data from a different perspective.

3. Data Marts

Description: Data marts are subsets of data warehouses designed for specific business units or
departments. They focus on a specific domain and provide faster access to relevant data. It
contains a streamlined and targeted dataset tailored to the needs of particular users or groups,
making data analysis faster and more efficient for specific purposes.

Characteristics of a Data Mart:

o Subject-Oriented:Each data mart is built around a single subject or business

area, such as sales, marketing, finance, or customer service.
o Smaller Scope:Unlike a data warehouse, which stores enterprise-wide data, a
data mart contains a smaller, more focused dataset.

o Optimized for Specific Use:Designed to meet the analytical needs of specific

users or departments.

o Derived from a Data Warehouse:

o Improved Performance:Since they store only relevant data, queries on a data

mart are faster compared to a full data warehouse.

Types of Data Marts:

o Dependent Data Mart:Created from a central data warehouse,Relies on the

enterprise data warehouse (EDW) for its data,Ensures consistency and integration
across the organization.

o Independent Data Mart:Built directly from operational systems or other

sources,Does not rely on a central data warehouse,Used when no enterprise data
warehouse exists.

o Hybrid Data Mart:Combines data from both a central data warehouse and other
sources.Useful when additional data not in the warehouse is required for analysis.

Example:

Consider an organization with an enterprise data warehouse containing company-wide data.

Sales Data Mart:

 Subject: Sales
 Data Included: Product sales, sales trends, regional sales, sales team performance
 Users: Sales managers, analysts

Marketing Data Mart:

 Subject: Marketing
 Data Included: Campaign performance, lead conversion rates, advertising expenses
 Users: Marketing team

Benefits of Data Marts:

1. Faster Access to Data: Since data marts are smaller and focused, users can retrieve data
more quickly.
2. Tailored Data Analysis: Specific to the needs of a department or group.
3. Cost-Effective: Easier and cheaper to implement than a full enterprise data warehouse.
4. Decentralized Control: Departments can have greater control over their data.
5. Improved Performance: Queries run faster on smaller, optimized datasets.

4. Data Lakes

 Description: Data lakes are large storage systems designed to handle raw, unstructured,
or semi-structured data, often used in big data and machine learning projects. Unlike data
warehouses, they store data in its native format.A Unlike a data warehouse, which
organizes data into a predefined schema, a data lake keeps data in a flexible, schema-on-
read model, enabling storage and analysis without transformation during ingestion.

Key Features of a Data Lake:

1. Raw Data Storage:

o Data is stored in its original format (structured, semi-structured, and unstructured)
without needing to be processed first.

2. Scalable:
o Designed to handle vast volumes of data, including real-time and batch data, with
horizontal scalability.

3. Flexible Schema:
o Uses a schema-on-read approach, meaning the schema is applied only when the
data is read or analyzed.

4. Cost-Effective:
o Typically built on inexpensive storage systems, such as Hadoop Distributed File
System (HDFS) or cloud storage (e.g., AWS S3, Azure Blob Storage).

5. Supports Diverse Data Types:

o Handles structured data (e.g., SQL tables), semi-structured data (e.g., JSON,
XML), and unstructured data (e.g., images, videos, logs).

6. Accessible for Big Data Analytics:

o Integrates well with big data processing frameworks like Apache Spark, Hive, or
Presto.

Data Lake vs. Data Warehouse:

Aspect Data Lake Data Warehouse
Raw (structured, semi-structured,
Data Format Processed and structured
unstructured)
Schema Schema-on-read Schema-on-write
Higher cost (optimized for
Cost Cost-effective (cheaper storage)
performance)
Use Cases Big data, machine learning, real-time Business intelligence, reporting
Aspect Data Lake Data Warehouse
analytics
Performance Requires additional processing for querying Optimized for fast querying

How a Data Lake Works:

o Ingestion:Data is ingested from various sources, such as IoT devices, databases,

social media, web logs, and more.

o Storage:Data is stored in its native format in a distributed storage system.

o Processing and Analytics:Big data tools (e.g., Spark, Hadoop) process the data
for analysis, reporting, or machine learning.

o Access:Users access the data using BI tools, SQL queries, or machine learning
frameworks.

Benefits of a Data Lake:

1. Flexibility:
o Can store and process any type of data.
2. Scalability:
o Can grow as needed to handle increasing data volumes.
3. Cost Savings:
o Cheaper storage options compared to data warehouses.
4. Support for Advanced Analytics:
o Enables machine learning, predictive analytics, and more.

Challenges of a Data Lake:

1. Data Governance:
o Without proper governance, it can turn into a "data swamp," making data hard to
find and use.
2. Performance:
o Querying raw data can be slower compared to a structured data warehouse.
3. Security:
o Requires strong access controls to protect sensitive data.

Key Characteristics of Data Repositories

1. Scalability: The ability to grow with increasing data volumes.

2. Performance: Optimized for fast querying and data retrieval.
3. Integration: Compatibility with various data sources and analytics tools.
4. Security: Ensures data confidentiality with encryption and access control.
5. Flexibility: Handles structured, semi-structured, and unstructured data types.
Advantages of Data Repositories

 Centralized Data Management: All data is stored in a single location, simplifying

management.
 Data Accessibility: Facilitates easy access for analytics and reporting.
 Cost Efficiency: Reduces duplication of data storage systems.
 Enhanced Security: Ensures proper governance and access controls.
 Supports Advanced Analytics: Enables machine learning and AI applications.

Applications of Data Repositories

1. E-commerce:
o Use data warehouses to analyze customer purchase trends.
o Store product reviews in NoSQL databases.
2. Healthcare:
o Use data lakes to store medical imaging data and electronic health records.
o Leverage big data platforms to process genomic data.
3. Finance:
o Use big data platforms for fraud detection and risk assessment.
o Store historical market data in data warehouses for predictive modeling.
4. Social Media Analytics:
o Use public repositories for sentiment analysis of user-generated content.
o Store streaming data from social platforms in data lakes for real-time analytics.
5. Research and Education:
o Use open data repositories to study climate change trends.
o Train machine learning models using publicly available datasets.

Data repositories are indispensable in modern data analytics. They provide efficient storage, easy
access, and powerful integration with analytics tools. By leveraging different types of
repositories such as databases, data warehouses, and data lakes, organizations can gain valuable
insights, optimize operations, and drive innovation. The selection of the right data repository
depends on the nature of the data and the analytical requirements, ensuring the effective
management of data in today’s digital era.

Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Learn SAP BI in 24 Hours
From Everand
Learn SAP BI in 24 Hours
Alex Nordeen
3/5 (1)
1202 MD - Rayhan 64-D
No ratings yet
1202 MD - Rayhan 64-D
5 pages
ISDM Group5 Review
No ratings yet
ISDM Group5 Review
23 pages
Data Mining Unit-2 notes
No ratings yet
Data Mining Unit-2 notes
8 pages
Ba Unit 2
No ratings yet
Ba Unit 2
20 pages
Data Mining
No ratings yet
Data Mining
25 pages
DWM QB Soln
No ratings yet
DWM QB Soln
18 pages
Business Intelligence
No ratings yet
Business Intelligence
17 pages
Unit 1
No ratings yet
Unit 1
22 pages
Data Warehousing Components - L3 - L4 - L5
No ratings yet
Data Warehousing Components - L3 - L4 - L5
26 pages
Difference Between Data Warehousing and Data Mining: Data Warehouse Architecture Three-Tier Data Warehouse Architecture
No ratings yet
Difference Between Data Warehousing and Data Mining: Data Warehouse Architecture Three-Tier Data Warehouse Architecture
10 pages
Data Warehousing
No ratings yet
Data Warehousing
4 pages
CS2032 Unit I Notes
No ratings yet
CS2032 Unit I Notes
23 pages
12 01 09 10 32 12 1287 Sindhujam PDF
No ratings yet
12 01 09 10 32 12 1287 Sindhujam PDF
23 pages
Data Warehouse
No ratings yet
Data Warehouse
56 pages
Data Warehouse and Data Mining
No ratings yet
Data Warehouse and Data Mining
12 pages
Lecture 2 - Datawarehouse
No ratings yet
Lecture 2 - Datawarehouse
50 pages
Database-Warehouse-Data-Mining
No ratings yet
Database-Warehouse-Data-Mining
29 pages
1a Ravi
No ratings yet
1a Ravi
37 pages
Data Warehouse For Bignners
No ratings yet
Data Warehouse For Bignners
14 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
92 pages
WA Data Warehouse
No ratings yet
WA Data Warehouse
16 pages
What Is A Data Mart - IBM
No ratings yet
What Is A Data Mart - IBM
9 pages
2 Data Warehousing Components L3 L4 L5
No ratings yet
2 Data Warehousing Components L3 L4 L5
26 pages
Data Warehouse
No ratings yet
Data Warehouse
4 pages
Group-4 - Data Warehousing
No ratings yet
Group-4 - Data Warehousing
33 pages
Lecture 13
No ratings yet
Lecture 13
17 pages
BDA U2
No ratings yet
BDA U2
44 pages
DWDM Notes - Final
No ratings yet
DWDM Notes - Final
46 pages
221
No ratings yet
221
2 pages
DWM GUFRAN NOTES
No ratings yet
DWM GUFRAN NOTES
318 pages
Chapter 1 Data Warehouse Fundamentals
No ratings yet
Chapter 1 Data Warehouse Fundamentals
26 pages
EDWH
No ratings yet
EDWH
10 pages
unit-1
No ratings yet
unit-1
23 pages
BA unit2 own
No ratings yet
BA unit2 own
10 pages
DWDM fresh notes for Unit 1,Unit 2 ,Unit 3
No ratings yet
DWDM fresh notes for Unit 1,Unit 2 ,Unit 3
54 pages
DWDM u-1
No ratings yet
DWDM u-1
45 pages
Data Warehouse
No ratings yet
Data Warehouse
11 pages
Need of Two Types of Data: Information
No ratings yet
Need of Two Types of Data: Information
7 pages
UNIT II
No ratings yet
UNIT II
45 pages
Data Warehouse: From Wikipedia, The Free Encyclopedia
No ratings yet
Data Warehouse: From Wikipedia, The Free Encyclopedia
5 pages
Data Warehouse - DWDM
No ratings yet
Data Warehouse - DWDM
54 pages
Data Warehouse
No ratings yet
Data Warehouse
74 pages
Data Warehouse - BSA 1st Year For BCA
No ratings yet
Data Warehouse - BSA 1st Year For BCA
20 pages
What Is Data Mart?
No ratings yet
What Is Data Mart?
4 pages
Data Warehouse
No ratings yet
Data Warehouse
3 pages
Chapter 2
No ratings yet
Chapter 2
44 pages
Data Warehousing
No ratings yet
Data Warehousing
8 pages
DWDM
No ratings yet
DWDM
15 pages
03 Data Warehouse
No ratings yet
03 Data Warehouse
27 pages
Data Warehouse & Data Mining
No ratings yet
Data Warehouse & Data Mining
12 pages
Data Mining
No ratings yet
Data Mining
3 pages
Data Warehousing: Made By-Bhanu Priya
No ratings yet
Data Warehousing: Made By-Bhanu Priya
10 pages
Unit II DATA BI
No ratings yet
Unit II DATA BI
13 pages
100 Important Questions with Solutions for Data Warehousing & Data Mining (BCS058)
No ratings yet
100 Important Questions with Solutions for Data Warehousing & Data Mining (BCS058)
119 pages
Paper 2 Datawarehouse Notes
No ratings yet
Paper 2 Datawarehouse Notes
20 pages
Unit 1 DWDM
No ratings yet
Unit 1 DWDM
122 pages
Data Warehousing
No ratings yet
Data Warehousing
20 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Database System MCQ
No ratings yet
Database System MCQ
6 pages
Unit 4 DSE
No ratings yet
Unit 4 DSE
9 pages
Exam Tdt4300 2022 Autumn Solutions
No ratings yet
Exam Tdt4300 2022 Autumn Solutions
14 pages
Sample Midterm Solutions
No ratings yet
Sample Midterm Solutions
10 pages
SG 246654
No ratings yet
SG 246654
288 pages
Backend Assignment - Dream11
No ratings yet
Backend Assignment - Dream11
2 pages
C TBW45 70 Sample
No ratings yet
C TBW45 70 Sample
6 pages
Xii -Cs Practice Codes for Practical (2024-25)
No ratings yet
Xii -Cs Practice Codes for Practical (2024-25)
3 pages
RGO-MD070 - Item Codification Utility
No ratings yet
RGO-MD070 - Item Codification Utility
16 pages
Section B (MCQS)
0% (1)
Section B (MCQS)
2 pages
All About Zookeeper and ClickHouse Keeper
No ratings yet
All About Zookeeper and ClickHouse Keeper
45 pages
Ma Correction Quizz Aws
No ratings yet
Ma Correction Quizz Aws
13 pages
Assignment No. 1: Lab Practices-2 Fourth Year Computer Engineering Engineering
No ratings yet
Assignment No. 1: Lab Practices-2 Fourth Year Computer Engineering Engineering
16 pages
Sample Paper 1
No ratings yet
Sample Paper 1
6 pages
NTC-S-Tubio-Ivan Cesar-Task-4
No ratings yet
NTC-S-Tubio-Ivan Cesar-Task-4
3 pages
DataStage Connectivity Guide For Microsoft SQL Server and OLE DB
No ratings yet
DataStage Connectivity Guide For Microsoft SQL Server and OLE DB
70 pages
Chapter-4 Logical Database Design: Objectives
No ratings yet
Chapter-4 Logical Database Design: Objectives
20 pages
Infranor Backup and Restore
No ratings yet
Infranor Backup and Restore
4 pages
RM 2 - Working With Tables
No ratings yet
RM 2 - Working With Tables
2 pages
Netezza
No ratings yet
Netezza
9 pages
Module 2 - Data Preprocessing and Visualization
No ratings yet
Module 2 - Data Preprocessing and Visualization
15 pages
Software Development Learning Path - Board Infinity
No ratings yet
Software Development Learning Path - Board Infinity
22 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Aims & Scope:: Even Remarkable, High-Impact Work Can Face Rejection When The Research Topic
No ratings yet
Aims & Scope:: Even Remarkable, High-Impact Work Can Face Rejection When The Research Topic
6 pages
Amity School of Engineering and Technology: Lab File
No ratings yet
Amity School of Engineering and Technology: Lab File
10 pages
Start Mongo Server
No ratings yet
Start Mongo Server
1 page
AWS Academy Cloud Foundations (3099) : TOT SMK 2021
No ratings yet
AWS Academy Cloud Foundations (3099) : TOT SMK 2021
7 pages
ORACLE NOTES - For - FULL STACK
No ratings yet
ORACLE NOTES - For - FULL STACK
113 pages
Subquery
No ratings yet
Subquery
4 pages
Chapter 02 Part A
No ratings yet
Chapter 02 Part A
26 pages

Data Repositories in Data Analytics

Uploaded by

Data Repositories in Data Analytics

Uploaded by

Data Repositories in Data Analytics

Types of Data Repositories

Key Features of a Data Warehouse:

o Subject-Oriented:Organized around major business subjects (e.g., sales,

4. Optimized for Querying and Analysis:

Data Warehouse Architecture:

2. ETL Process (Extract, Transform, Load):

Benefits of a Data Warehouse:

1. Centralized Data Access:

2. Improved Data Quality:

Example of a Data Warehouse:

An e-commerce company might use a data warehouse to combine:

 Sales data from the online store.

This enables reporting on metrics like:

 Sales trends by region or product category.

Popular Data Warehouse Tools:

 On-Premise: Oracle, Teradata, IBM Db2, Microsoft SQL Server

Key Concepts of a Data Cube:

Example of a Data Cube:

 Time: Years (2022, 2023, 2024)

A single cell in the cube could represent:

Operations on a Data Cube:

Characteristics of a Data Mart:

o Subject-Oriented:Each data mart is built around a single subject or business

o Optimized for Specific Use:Designed to meet the analytical needs of specific

o Derived from a Data Warehouse:

o Improved Performance:Since they store only relevant data, queries on a data

Types of Data Marts:

o Dependent Data Mart:Created from a central data warehouse,Relies on the

o Independent Data Mart:Built directly from operational systems or other

Consider an organization with an enterprise data warehouse containing company-wide data.

Sales Data Mart:

Marketing Data Mart:

Benefits of Data Marts:

Key Features of a Data Lake:

1. Raw Data Storage:

5. Supports Diverse Data Types:

6. Accessible for Big Data Analytics:

Data Lake vs. Data Warehouse:

How a Data Lake Works:

o Ingestion:Data is ingested from various sources, such as IoT devices, databases,

o Storage:Data is stored in its native format in a distributed storage system.

Benefits of a Data Lake:

Challenges of a Data Lake:

Key Characteristics of Data Repositories

1. Scalability: The ability to grow with increasing data volumes.

 Centralized Data Management: All data is stored in a single location, simplifying

Applications of Data Repositories

You might also like