BDH_practical_08_29
BDH_practical_08_29
Aim: Pig Operations: Load & Store Data, Aggregation Operations, Filtering Data and
Joining Datasets.
Theory:
Apache Pig offers a powerful scripting language, Pig Latin, to handle data transformation and
analysis on large datasets. Pig operations simplify the processing of structured and
unstructured data. Key operations include:
1. Loading and Storing Data: Pig allows data to be loaded from various sources such
as HDFS, local files, or other storage systems, and it can store the output back into
HDFS or other destinations.
2. Aggregation Operations: These include functions like COUNT, SUM, AVG, and
GROUP, which help in summarizing and aggregating large data sets.
3. Filtering Data: Pig supports filtering operations, allowing users to extract a subset of
data based on specific conditions or expressions.
4. Joining Datasets: Pig allows joining multiple datasets on a common field or key,
facilitating complex data analysis tasks.
These operations provide flexibility in managing big data pipelines, making Apache Pig a
versatile tool for big data analytics.
Implementation:
• Load Data into Pig:
data = LOAD 'hdfs://path/to/data.csv' USING PigStorage(',') AS (field1:datatype,
field2:datatype, ...);
• Store Data in HDFS:
• STORE data INTO 'hdfs://path/to/output' USING PigStorage(',');
Aggregation Operations:
• GROUP: Group records based on a common field for aggregation.
• grouped_data = GROUP employee_data BY department;
Conclusion:
The operations provided by Apache Pig, such as loading and storing data, performing
aggregation, filtering datasets, and joining multiple datasets, make it a highly effective tool
for managing large-scale data processing in a distributed Hadoop environment. Pig’s ability
to execute complex data manipulations with simple scripts accelerates data analytics
workflows, enabling developers and analysts to gain insights quickly and efficiently from big
data. The integration of these operations into your data pipeline can significantly streamline
the processing of massive datasets.
Submitted to: Mrs. Kiran Khandare Ma’am
Name: Aman Raut Roll No.: 29 Reg. No.: 21071360