0% found this document useful (0 votes)
19 views

Unit-4

Uploaded by

dhanashree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Unit-4

Uploaded by

dhanashree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Unit-4: Data Management using Cloud Computing(12 Marks)

1. Architecture of Modern Data Pipeline:-


i. A data pipeline is a method to accept raw data from various
sources, processes this data to convert it into meaningful
information, and then push it into storage like a data lake or data
warehouse.

A Data Pipeline Architecture is a blueprint or framework for moving data


from various sources to a destination. It involves a sequence of steps or
stages that process data, starting with collecting raw data from multiple
sources and then transforming and preparing it for storage and analysis.
The architecture includes components for data ingestion, transformation,
storage, and delivery. The pipeline might also have various tools and
technologies, such as data integration platforms, data warehouses, and
data lakes, for storing and processing the data.

 TYPES OF DATA PIPELINES:-


Different types of data pipelines are conceived to serve specific data
processing and analysis requirements. Here are some common types of
data pipelines:
1. Batch data pipeline:
Batch data pipelines process data in large batches or sets at specific
intervals. They are suitable for scenarios where near real-time data
Processing is unnecessary and periodic updates or data refreshes are
sufficient.
Batch pipelines typically involve extracting data from various sources,
performing transformations, and loading the processed data into a target
destination.
1. Real-time data pipeline:
Real-time data pipelines handle data in a continuous and streaming
manner, processing data as it arrives or is generated. They are ideal
for scenarios that require immediate access to the most up-to-date
data and where real-time analytics or actions are necessary.
Real-time pipelines often involve ingesting data streams, performing
transformations or enrichments on the fly ,and delivering the
processed data to downstream systems or applications in real time.
2. Event-driven data pipeline:
Event-driven pipelines are triggered by specific events or actions,
such as data updates, system events, or user interactions. They
respond to these events and initiate data processing tasks
accordingly.
Event-driven pipelines are commonly used in scenarios where data
processing needs to be triggered by specific events rather than on a
predefined schedule.
 DATA PIPELINE ARCHITECTURE:-
A Data Pipeline Architecture is a blueprint or framework for moving data
from various sources to a destination. It involves a sequence of steps or
stages that process data, starting with collecting raw data from multiple
sources and then transforming and preparing it for storage and analysis.
The architecture includes components for data ingestion, transformation,
storage, and delivery. The pipeline might also have various tools and
technologies, such as data integration platforms, data warehouses, and data
lakes, for storing and processing the data.
Data pipeline architectures are crucial for efficient data management,
processing, and analysis in modern businesses and organizations.
We breakdown data pipeline architecture into a series of parts and
processes, including:
i. DATA INGESTION:-
Sources:
Data sources refer to any place or application from which data iscollected
for analysis, processing, or storage. Examples of data sources include
databases, data warehouses, cloud storage systems, files on local drives,
APIs, social media platforms, and sensor data from IoT devices.
Datacanbestructured,semi-structured,orunstructured,dependingonthe
source. The selection of the source fully depends on the intended use & the
requirements of the data pipeline or analytics application.
Joins
The data flows in from multiple sources. Joins are the logic implemented
to define how the data is combined. When performing joins between
different data sources, the process can be more complex than traditional
database joins due to differences in data structure, format, and storage.
Extraction
Data extraction is the process of extracting or retrieving specific data
from a larger dataset or source. This can involve parsing through
unstructured data to find relevant information or querying a database to
retrieve specific records or information.
Data extraction is an important part of data analysis, as it allows analysts
to focus on specific subsets of data and extract insights and findings from
that data.
DATA TRANSFORMATION
Standardization
Data standardization ,also known as data normalization, is the process
of transforming and organizing data into a consistent format that
adheres to predefined standards.
It involves applying set of rules or procedures to ensure that data from
different sources or systems are structured and formatted uniformly,
making it easier to compare, analyze, and integrate.
Data standardization typically involves the following steps:
 Data cleansing
 Data formatting
 Data categorization
 Data validation
 Data integration
Correction
Data correction ,also known as data cleansing or data scrubbing, refers to
the process of identifying and rectifying errors, inconsistencies,
inaccuracies, or discrepancies within a dataset.
DATASTORAGE
Load
In data engineering, data loading refers to the process of ingesting or
importing data from various sources into a target destination, such as a
database, data warehouse, or data lake.
It involves moving the data from its source format to a storage or
processing environment where it can be accessed, managed, and analyzed
effectively.
Automation
Data pipeline automation refers to the practice of automating the process
of creating, managing, and executing data pipelines.
A data pipeline is a series of interconnected steps that involve extracting,
transforming, and loading (ETL) data from various sources to a target
destination for analysis, reporting, or other purposes.
Automating this process helps streamline data workflows, improve
efficiency, and reduce manual intervention.
2. Data Pipeline Characteristics:-Only robust end-to-end data pipelines
can properly equip you to source, collect, manage, analyze, and
effectively use data so you can generate new market opportunities and
deliver cost-saving business processes. Modern data pipelines make
extracting information from the data you collect fast and efficient.
Characteristics to look for when considering a data pipeline include:
1. Continuous and extensible data processing
2. The elasticity and agility of the cloud
3. Isolated and independent resources for data processing
4. Democratized data access and self-service management
5. High availability and disaster recovery

3. Collecting and Ingesting Data:-

Data Collection: Definition: Data collection is the process of gathering raw data
from various sources and compiling it into a central location for analysis. It is
typically the first step in the data analysis process.

Diagram:

+ +
|Data Sources|
+ +
|
v
+ +
|Data Storage|
+ +

Data Ingestion : Definition :Data ingestion is the process of taking data from
various sources and preparing it for analysis. This can involve transforming the
data, cleaning it, and structuring it so that it can be easily analyzed.
Diagram:

+ +
|Data Sources|
+ +
|
v
+ +
|Data Ingestion|
+ +
|
v
+ +
|Data Storage|
+ +

Key Differences:
1. Data collection involves gathering raw data from various sources,
while data ingestion involves processing and preparing data for
analysis.
2. Data collection is typically a one-time process, while data ingestion can
be an ongoing process.
3. Data collection can involve manual entry of data, while data ingestion
is typically an automated process.
4. Data collection can be a time-consuming and resource-intensive process,
while data ingestion can be faster and more efficient.
5. Data collection is often done in a decentralized manner, while data
ingestion is typically centralized.

4. Transforming Data:-
Data transformation is the process of converting and cleaning raw data
from one data source to meet the requirements of its new location. Also
called data wrangling, transforming data is essential to ingestion
workflows that feed data warehouses and modern data lakes. Analytics
projects may also use data transformation to prepare warehoused data for
analysis.
Why is data transformation important?
Data transformation remains in dispensable because enterprise data
ecosystems remain stubbornly heterogeneous despite decades of
centralization and standardization initiatives. Each application and storage
system takes slightly different approaches to formatting and structuring
data. Organizational format, structure, and quality variations occur as
business domains and regional operations develop their own data systems.
Without data transformation, data analysts would have to fix these
inconsistencies each time they tried to combine two data sources. This
project- by-project approach consumes resources, risks variations
between analyses, and makes decision-making less effective.
The process of transforming data from multiple sources to meet a single
standard improves the efficiency of a company’s data analysis operations
by delivering the following benefits:
i. Data quality improvement
ii. Data Consistency
iii. Data Integration
iv. Data Analysis

5. Designing Pipeline:-

Step 1: Determine the goal in building data pipelines


Your first step when building data pipelines is to identify the outcome or value it
will offer your company or product. At this point, you’d ask questions like:
i. What are our objectives for this data pipeline?
ii. How do we measure the success of the data pipeline?
iii. What use cases will the data pipeline serve (reporting, analytics, machine
learning)?
iv. Who are the end-users of the data that this pipeline will produce? How will
that data help them meet their goals?
Step 2: Choose the data sources
In the next step, consider the possible data sources to enter the data pipeline. Ask
questions such as:
i. What are all the potential sources of data?
ii. In what format will the data come in (flat files, JSON, XML)?
iii. How will we connect to the data sources?
Step 3: Determine the data ingestion strategy
Now that you understand your pipeline goals and have defined data sources, it’s
time to ask questions about how the pipeline will collect the data. Ask questions
including:
i. Should we build our own data ingest pipelines in-house with python, airflow,
and other script ware?
ii. Would we be utilizing third-party integration tools to ingest the data?
iii. Are we going to be using intermediate data stores to store data as it flows to
the destination?
iv. Are we collecting data from the origin in predefined batches or in real time?
Step 4: Design the data processing plan
Once data is ingested, it must be processed and transformed for it to be valuable to
downstream systems. At this stage, you’ll ask questions like:
i. What data processing strategies are we utilizing on the data (ETL, ELT,
cleaning, formatting)?
ii. Are we going to be enriching the data with specific attributes?
iii. Are we using all the data or just a subset?
iv. How do we remove redundant data?
Step 5: Set up storage for the output of the pipeline
Once the data gets processed, we must determine the final storage destination for our
data to serve various business use cases. Ask questions including:
i. Are we going to be using big data stores like data warehouses or data lakes?
ii. Would the data be stored on cloud or on-premises?’
iii. Which of the data stores will serve our top use cases?
iv. In what format will the final data be stored?
Step 6: Plan the data workflow
Now, it’s time to design the sequencing of processes in the data pipeline. At this
stage, we ask questions such as:
i. What downstream jobs are dependent on the completion of an upstream job?
ii. Are there jobs that can run in parallel?
iii. How do we handle failed jobs?
Step 7: Implement a data monitoring and governance framework
You’ve almost built an entire data pipeline! Our second to final step includes
establishing a data monitoring and governance framework, which helps us observe
the data pipeline to ensure a healthy and efficient channel that’s reliable, secure, and
performs as required. In this step, we determine:
i. What needs to be monitored? Dropped records? Failed pipeline runs? Node
outages?
ii. How do we ensure data is secure and no sensitive data is exposed?
iii. How do we secure the machines running the data pipelines?
iv. Is the data pipeline meeting the delivery SLOs?
v. Who is in charge of data monitoring?
Step 8: Plan the data consumption layer
This final step determines the various services that’ll consume the processed data
from our data pipeline. At the data consumption layer, we ask questions such as:
i. What’s the best way to harness and utilize our data?
ii. Do we have all the data we need for our intended use case?
iii. How do our consumption tools connect to our data stores?

6. Evolving from ETL to ELT:-


What’s ETL
To simplify, ETL or Extract Transform Load is a data integration process that
involves extracting data from various sources, transforming it into a suitable
format(arranging it), and loading it into a target data warehouse or data hub. As
the name suggests, it involves:
Extract:
This phase involves retrieving data from disparate sources such as databases, flat
files, or APIs.
Transform:
Data is cleaned, standardized, aggregated, and manipulated to meet business
requirements. This includes data cleansing, formatting, calculations, and data
enrichment.
Load:
The transformed data is transferred into the target system, often a data warehouse,
for analysis and reporting.
ETL processes are critical for building data warehouses and enabling business
intelligence and advanced analytics capabilities.

Defining ELT
ELT is a data integration process where raw data is extracted from various sources
and loaded into a data lake or data warehouse without immediate transformation
(that’s done later). The data is transformed only when needed for specific analysis
or reporting. As the name suggests, it involves:
Extract:
Data is pulled from disparate sources.
Load:
Raw data is stored in a data lake or data warehouse in its original format.
Transform:
Data is transformed and processed as needed for specific queries or reports. This
approach uses cloud computing and big data technologies to handle large volumes
of data efficiently and at the right time.
ELT is often associated with cloud-based data warehousing and big data analytics
platforms.

The Shift from ETL to ELT: Evolving Data Integration:-


The shift from ETL to ELT represents more than just a change in process—it’s a
fundamental shift in how businesses handle their data. Data analytics
companies understand that the future is digital, and staying a step ahead requires
not just adapting to new technologies, but leading the way. Our mission is to help
businesses like yours use the power of data, ensuring that every data point
contributes to your business sustainability.
For decades, ETL has been the front face of data integration. As explained above,
the process involves extracting data from various sources, transforming it into a
suitable format, and then loading it into a data warehouse or other system for
analysis. While ETL has served us well, it comes with significant limitations.
 Data Latency: Traditional ETL processes often result in delays, meaning that by
the time data is ready for analysis, it may already be outdated/old.
 Complexity: ETL can be complex and time-consuming, requiring substantial
resources to manage the entire data transformation process.
 Cost: The infrastructure needed to support ETL can be expensive, particularly as
data volumes grow. It affects scalability all around.
 ELT flips the traditional model on its head. Instead of transforming data before
it’s loaded, ELT first loads the raw data into a data warehouse or data lake and
then performs transformations as needed. This shift offers many advantages:
 Better Agility: By loading data first, businesses can start working with their data
much sooner, allowing for more agile and responsive decision-making.
 Scalability: ELT is better suited for the massive datasets that are becoming the
norm today. It scales more easily and efficiently than traditional ETL processes.
 Cost-Efficiency: With ELT, businesses can utilize cloud-based data storage and
processing solutions, reducing the need for expensive on-premise infrastructure.

7. Delivering and sharing data:-


Data delivery in a modern data pipeline is the process of moving processed and
analyzed data to a target system or application. This can include a database, data
warehouse, reporting tool, or dashboard.

Here are some things to consider about data delivery and sharing in a modern data
pipeline:
 DataOps
Modern data pipelines are developed using the DataOps methodology, which
combines technologies and processes to automate data pipelines and shorten
development and delivery cycles.
 Data sharing
Amazon Redshift Data Sharing allows data producers to define permissions and grant
access to consumers. Consumers can access live copies of the data, which is still
owned by the producer.
 Data sources
Data pipelines can ingest, process, prepare, transform, and enrich structured,
unstructured, and semi-structured data.
 Data storage
Data pipelines can store data in a data lake or data warehouse. Data lakes are better
for organizations that need a large repository for raw data, while data warehouses
are better for organizations that need quick access to structured data.
 Data tools
Modern data pipeline tools combine drag-and-drop canvas builders with inline
code and query editors. These tools support open formats like Python or SQL for
transformation logic, and YAML for describing the pipeline topology.

You might also like