Unit-4
Unit-4
Data Collection: Definition: Data collection is the process of gathering raw data
from various sources and compiling it into a central location for analysis. It is
typically the first step in the data analysis process.
Diagram:
+ +
|Data Sources|
+ +
|
v
+ +
|Data Storage|
+ +
Data Ingestion : Definition :Data ingestion is the process of taking data from
various sources and preparing it for analysis. This can involve transforming the
data, cleaning it, and structuring it so that it can be easily analyzed.
Diagram:
+ +
|Data Sources|
+ +
|
v
+ +
|Data Ingestion|
+ +
|
v
+ +
|Data Storage|
+ +
Key Differences:
1. Data collection involves gathering raw data from various sources,
while data ingestion involves processing and preparing data for
analysis.
2. Data collection is typically a one-time process, while data ingestion can
be an ongoing process.
3. Data collection can involve manual entry of data, while data ingestion
is typically an automated process.
4. Data collection can be a time-consuming and resource-intensive process,
while data ingestion can be faster and more efficient.
5. Data collection is often done in a decentralized manner, while data
ingestion is typically centralized.
4. Transforming Data:-
Data transformation is the process of converting and cleaning raw data
from one data source to meet the requirements of its new location. Also
called data wrangling, transforming data is essential to ingestion
workflows that feed data warehouses and modern data lakes. Analytics
projects may also use data transformation to prepare warehoused data for
analysis.
Why is data transformation important?
Data transformation remains in dispensable because enterprise data
ecosystems remain stubbornly heterogeneous despite decades of
centralization and standardization initiatives. Each application and storage
system takes slightly different approaches to formatting and structuring
data. Organizational format, structure, and quality variations occur as
business domains and regional operations develop their own data systems.
Without data transformation, data analysts would have to fix these
inconsistencies each time they tried to combine two data sources. This
project- by-project approach consumes resources, risks variations
between analyses, and makes decision-making less effective.
The process of transforming data from multiple sources to meet a single
standard improves the efficiency of a company’s data analysis operations
by delivering the following benefits:
i. Data quality improvement
ii. Data Consistency
iii. Data Integration
iv. Data Analysis
5. Designing Pipeline:-
Defining ELT
ELT is a data integration process where raw data is extracted from various sources
and loaded into a data lake or data warehouse without immediate transformation
(that’s done later). The data is transformed only when needed for specific analysis
or reporting. As the name suggests, it involves:
Extract:
Data is pulled from disparate sources.
Load:
Raw data is stored in a data lake or data warehouse in its original format.
Transform:
Data is transformed and processed as needed for specific queries or reports. This
approach uses cloud computing and big data technologies to handle large volumes
of data efficiently and at the right time.
ELT is often associated with cloud-based data warehousing and big data analytics
platforms.
Here are some things to consider about data delivery and sharing in a modern data
pipeline:
DataOps
Modern data pipelines are developed using the DataOps methodology, which
combines technologies and processes to automate data pipelines and shorten
development and delivery cycles.
Data sharing
Amazon Redshift Data Sharing allows data producers to define permissions and grant
access to consumers. Consumers can access live copies of the data, which is still
owned by the producer.
Data sources
Data pipelines can ingest, process, prepare, transform, and enrich structured,
unstructured, and semi-structured data.
Data storage
Data pipelines can store data in a data lake or data warehouse. Data lakes are better
for organizations that need a large repository for raw data, while data warehouses
are better for organizations that need quick access to structured data.
Data tools
Modern data pipeline tools combine drag-and-drop canvas builders with inline
code and query editors. These tools support open formats like Python or SQL for
transformation logic, and YAML for describing the pipeline topology.