0% found this document useful (0 votes)
13 views

ABP W9-W10 Big Data Analytics Lab-PIG

The document outlines the implementation of Pig Latin commands for big data analytics using Cloudera, focusing on relational and diagnostic operations on student and employee datasets. It details steps for loading, storing, and manipulating data in the Hadoop Pig framework, including operations like filtering, grouping, joining, and splitting. Additionally, it provides specific commands and examples for executing these operations in the Grunt shell.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

ABP W9-W10 Big Data Analytics Lab-PIG

The document outlines the implementation of Pig Latin commands for big data analytics using Cloudera, focusing on relational and diagnostic operations on student and employee datasets. It details steps for loading, storing, and manipulating data in the Hadoop Pig framework, including operations like filtering, grouping, joining, and splitting. Additionally, it provides specific commands and examples for executing these operations in the Grunt shell.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

BIG DATA ANALYTICS LAB

(A7902) (VCE-R21)

Week-9 Pig Latin commands

a) Implement Relational operators –Loading and Storing, and


Diagnostic operators -Dump, Describe, Illustrate & Explain
on the given database in Hadoop Pig framework using
Cloudera.
b) Develop a Pig Latin program to implement Filtering, Sorting
operations on the given database.

For the given Student dataset and Employee dataset, perform Relational operations like
Loading, Storing, Diagnostic Operations (Dump, Describe, Illustrate & Explain) in Hadoop
Pig framework using Cloudera
Student ID First Name Age City CGPA
001 Jagruthi 21 Hyderabad 9.1
002 Praneeth 22 Chennai 8.6
003 Sujith 22 Mumbai 7.8
004 Sreeja 21 Bengaluru 9.2
005 Mahesh 24 Hyderabad 8.8
006 Rohit 22 Chennai 7.8
007 Sindhu 23 Mumbai 8.3

Employee ID Name Age City


001 Angelina 22 LosAngeles
002 Jackie 23 Beijing
003 Deepika 22 Mumbai
004 Pawan 24 Hyderabad
005 Rajani 21 Chennai
006 Amitabh 22 Mumbai

Step-1: Create a Directory in HDFS with the name pigdir in the required path using mkdir:
$ hdfs dfs -mkdir /bdalab/pigdir
Step-2: The input file of Pig contains each tuple/record in individual lines with the entities
separated by a delimiter ( “,”).

In the local file system, create an input In the local file system, create an input
file student_data.txt containing data as file employee_data.txt containing data
shown below. as shown below.

A. Bhanu Prasad, Associate Professor of CSE, VCE


BIG DATA ANALYTICS LAB
(A7902) (VCE-R21)

001,Jagruthi,21,Hyderabad,9.1 001,Angelina,22,LosAngeles

002,Praneeth,22,Chennai,8.6 002,Jackie,23,Beijing

003,Sujith,22,Mumbai,7.8 003,Deepika,22,Mumbai

004,Sreeja,21,Bengaluru,9.2 004,Pawan,24,Hyderabad

005,Mahesh,24,Hyderabad,8.8 005,Rajani,21,Chennai

006,Rohit,22,Chennai,7.8 006,Amitabh,22,Mumbai

007,Sindhu,23,Mumbai,8.3
Step-3: Move the file from the local file system to HDFS using put (Or) copyFromLocal
command and verify using -cat command
$ hdfs dfs -put /home/cloudera/pigdir/student_data /bdalab/pigdir/
$ hdfs dfs -cat /bdalab/pigdir/student_data
$ hdfs dfs -put /home/cloudera/pigdir/employee_data /bdalab/pigdir/
$ hdfs dfs -cat /bdalab/pigdir/employee_data
Step-4: Apply Relational Operator – LOAD to load the data from the file
student_data.txt into Pig by executing the following Pig Latin statement in the
Grunt shell. Relational Operators are NOT case sensitive.
$ pig => will direct to grunt> shell
grunt> student = LOAD '/bdalab/pigdir/student_data.txt' USING PigStorage(',')
as ( id:int, name:chararray, age:int, city:chararray, cgpa:double );
grunt> employee = LOAD '/bdalab/pigdir/employee_data.txt’ USING
PigStorage(',') as ( id:int, name:chararray, age:int, city:chararray);
Step-5: Apply Relational Operator – STORE to Store the relation in the HDFS directory
“/pig_output/” as shown below.
grunt> STORE student INTO '/bdalab/pigdir/pig_output/ ' USING PigStorage (',');
grunt> STORE employee INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage
(',');

Step-6: Verify the stored data as shown below


$ hdfs dfs -ls /bdalab/pigdir/pig_output/
$ hdfs dfs -cat /bdalab/pigdir/pig_output/part-m-00000

A. Bhanu Prasad, Associate Professor of CSE, VCE


BIG DATA ANALYTICS LAB
(A7902) (VCE-R21)

Step-7: Apply Relational Operator – Diagnostic Operator – DUMP to Print the


contents of the relation.
grunt> Dump student
grunt> Dump employee
Step-8: Apply Relational Operator – Diagnostic Operator – DESCRIBE to View the
schema of a relation.
grunt> Describe student
grunt> Describe employee
Step-9: Apply Relational Operator – Diagnostic Operator – ILLUSTRATE to give the
step-by-step execution of a sequence of statements
grunt> Illustrate student
grunt> Illustrate employee
Step-10: Apply Relational Operator – Diagnostic Operator – EXPLAIN to Display the
logical, physical, and MapReduce execution plans of a relation using
Explain operator
grunt> Explain student
grunt> Explain employee

A. Bhanu Prasad, Associate Professor of CSE, VCE


BIG DATA ANALYTICS LAB
(A7902) (VCE-R21)

Week-10 Pig Latin commands


a) Implement Grouping, Joining, Combining and Splitting
operations on the given database using Pig Latin statements.
b) Perform Eval Functions on the given dataset.
c) Develop a WordCount program using Pig Latin statements.
10.A) Implement Grouping, Joining, Combining and Splitting operations
on the given database using Pig Latin statements
The GROUP operator is used to group the data in one or more relations. It collects the data
having the same key.
grunt> Group_data = GROUP Relation_name BY Key;
Step-1: Group the records/tuples in the relation by age using GROUP command and
verify.
grunt> group_std = GROUP student BY age;
grunt> Dump group_std;
grunt> group_emp = GROUP employee BY city;
grunt> Dump group_emp;
Step-2: View Schema of the table after grouping the data using the describe command
as shown below.
grunt> Describe group_std;
group_std: {group: int,student: {(id:int, name:chararray, age:int, city:chararray,
cgpa:float)}}
grunt> Describe group_emp;
group_emp: {group: int,employee: {(id: int,name: chararray,age:int,city:
chararray)}}
Step-3: Group by multiple columns of the relation by age and city and verify the
content.
grunt> groupmultiple_std = GROUP student BY (age, city);
grunt> Dump groupmultiple_std
grunt> groupmultiple_emp = GROUP employee BY (age, city);
grunt> Dump groupmultiple_emp
Step-4: Group by All columns of the relation and verify the content.
grunt> groupall_std = GROUP student All;
grunt> Dump groupall_std

A. Bhanu Prasad, Associate Professor of CSE, VCE


BIG DATA ANALYTICS LAB
(A7902) (VCE-R21)

grunt> groupall_emp = GROUP employee All;


grunt> Dump groupall_emp
Step-5: Combinedly Group the records/tuples of the relations student_data and
employee_data with the key age and then verify the result.
grunt> cogroup_stdemp = COGROUP student_data by age, employee_data by age;
grunt> Dump cogroup_stdemp
The JOIN operator is used to combine records from two or more relations. While
performing a join operation, we declare one (or a group of) tuple(s) from each relation, as
keys. When these keys match, the two tuples are matched, else the records are dropped.
Joins can be of the following types −

• SELF-Join

• INNER-Join

• OUTER-Join − LEFT Join, RIGHT Join, and FULL Join


Step-6: SELF-JOIN, we will load the same data multiple times, under different aliases
(names). grunt> std1 = LOAD ' /bdalab/pigdir/student_data ' USING
PigStorage(',') as (id:int, name:chararray, age:int, city:chararray, cgpa:float);
grunt> std2 = LOAD ' /bdalab/pigdir/student_data ' USING PigStorage(',') as
(id:int, name:chararray, age:int, city:chararray, cgpa:float );
grunt> selfjoin_std_data = JOIN students1 BY id, students2 BY id;
grunt> dump selfjoin_std_data;
(1,Jagruthi,21,Hyderabad,9.1,1,Jagruthi,21,Hyderabad,9.1)
(2,Praneeth,22,Chennai,8.6,2,Praneeth,22,Chennai,8.6)
(3,Sujith,22,Mumbai,7.8,3,Sujith,22,Mumbai,7.8)
(4,Sreeja,21,Bengaluru,9.2,4,Sreeja,21,Bengaluru,9.2)
(5,Mahesh,24,Hyderabad,8.8,5,Mahesh,24,Hyderabad,8.8)
(6,Rohit,22,Chennai,7.8,6,Rohit,22,Chennai,7.8)
(7,Sindhu,23,Mumbai,8.3,7,Sindhu,23,Mumbai,8.3)
Step-7: INNER JOIN - EQUI JOIN creates a new relation by combining column values
of two relations based upon the join-predicate. It returns rows when there is a
match in both tables.
grunt> innerjoin_data_att = JOIN std_data BY id, std_att BY id;
grunt> dump innerjoin_data_att;
(1,Jagruthi,21,Hyderabad,9.1,1,Jagruthi,joined,9:10:10)
(4,Sreeja,21,Bengaluru,9.2,4,Sreeja,joined,9:10:24)
(6,Rohit,22,Chennai,7.8,6,Rohit,joined,9:11:15)
(7,Sindhu,23,Mumbai,8.3,7,Sindhu,joined,9:12:25)

A. Bhanu Prasad, Associate Professor of CSE, VCE


BIG DATA ANALYTICS LAB
(A7902) (VCE-R21)

OUTER JOIN returns all the rows from at least one of the relations. An outer join operation
is carried out in three ways −
• LEFT OUTER JOIN
• RIGHT OUTER JOIN
• FULL OUTER JOIN
Step-8: LEFT OUTER JOIN operation returns all rows from the left table, even if there are
no matches in the right relation.
Note: Student_data is LEFT
grunt> outerleft_data_att = JOIN std_data BY id LEFT, std_att BY id;
grunt> DUMP outerleft_data_att
(1,Jagruthi,21,Hyderabad,9.1,1,Jagruthi,joined,9:10:10)
(2,Praneeth,22,Chennai,8.6,,,,)
(3,Sujith,22,Mumbai,7.8,,,,)
(4,Sreeja,21,Bengaluru,9.2,4,Sreeja,joined,9:10:24)
(5,Mahesh,24,Hyderabad,8.8,,,,)
(6,Rohit,22,Chennai,7.8,6,Rohit,joined,9:11:15)
(7,Sindhu,23,Mumbai,8.3,7,Sindhu,joined,9:12:25)
Note: Student_att is LEFT
grunt> outerleft_att_data = JOIN std_att BY id LEFT, std_data BY id;
grunt> DUMP outerleft_att_data;
(1,Jagruthi,joined,9:10:10,1,Jagruthi,21,Hyderabad,9.1)
(4,Sreeja,joined,9:10:24,4,Sreeja,21,Bengaluru,9.2)
(6,Rohit,joined,9:11:15,6,Rohit,22,Chennai,7.8)
(7,Sindhu,joined,9:12:25,7,Sindhu,23,Mumbai,8.3)
(8,Sai,joined,9.14:18,,,,,)
(9,Meghana,joined,9.15:25,,,,,)

Step-9: RIGHT OUTER JOIN operation returns all rows from the right table, even if there
are no matches in the left table.
grunt> outerright_data_att = JOIN std_data BY id RIGHT, std_att BY id;
grunt> DUMP outerright_data_att;
(1,Jagruthi,21,Hyderabad,9.1,1,Jagruthi,joined,9:10:10)
(4,Sreeja,21,Bengaluru,9.2,4,Sreeja,joined,9:10:24)
(6,Rohit,22,Chennai,7.8,6,Rohit,joined,9:11:15)
(7,Sindhu,23,Mumbai,8.3,7,Sindhu,joined,9:12:25)
(,,,,,8,Sai,joined,9.14:18)
(,,,,,9,Meghana,joined,9.15:25)

A. Bhanu Prasad, Associate Professor of CSE, VCE


BIG DATA ANALYTICS LAB
(A7902) (VCE-R21)

Step-10: FULL OUTER JOIN operation returns rows when there is a match in one of the
relations.
grunt> outerfull_data_att = JOIN std_data BY id FULL, std_att BY id;
grunt> DUMP outerfull_data_att;
(1,Jagruthi,21,Hyderabad,9.1,1,Jagruthi,joined,9:10:10)
(2,Praneeth,22,Chennai,8.6,,,,)
(3,Sujith,22,Mumbai,7.8,,,,)
(4,Sreeja,21,Bengaluru,9.2,4,Sreeja,joined,9:10:24)
(5,Mahesh,24,Hyderabad,8.8,,,,)
(6,Rohit,22,Chennai,7.8,6,Rohit,joined,9:11:15)
(7,Sindhu,23,Mumbai,8.3,7,Sindhu,joined,9:12:25)
(,,,,,8,Sai,joined,9.14:18)
(,,,,,9,Meghana,joined,9.15:25)
Step-11: FILTER operator is used to select the required tuples from a relation based
on a condition
grunt> filter_std = FILTER std_data BY city == 'Hyderabad';
grunt> DUMP filter_std;
(1,Jagruthi,21,Hyderabad,9.1)
(5,Mahesh,24,Hyderabad,8.8)

Step-12: SPLIT operator is used to split a relation into two or more relations
grunt> SPLIT std_data INTO split_std1 IF age<23, split_std2 IF (age>22 AND
age<25);
grunt> DUMP split_std1;
(1,Jagruthi,21,Hyderabad,9.1)
(2,Praneeth,22,Chennai,8.6)
(3,Sujith,22,Mumbai,7.8)
(4,Sreeja,21,Bengaluru,9.2)
(6,Rohit,22,Chennai,7.8)
grunt> DUMP split_std2;
(5,Mahesh,24,Hyderabad,8.8)
(7,Sindhu,23,Mumbai,8.3)

A. Bhanu Prasad, Associate Professor of CSE, VCE


BIG DATA ANALYTICS LAB
(A7902) (VCE-R21)

5.B) Perform Eval Functions on the given dataset.


 There is a huge set of Apache Pig Built in Functions available. Such as the eval, load/store,
math, string, date and time, bag and tuple functions.
 Eval Functions is the first types of Pig Built in Functions. Here are the Pig Eval functions,
offered by Apache Pig.

1) AVG(expression):
To compute the average of the numerical values within a bag. It requires a preceding GROUP ALL
statement for global averages and a GROUP BY statement for group averages. However, it ignores
the NULL values.
Ex: Average GPA for each Employee is computed
grunt> A = LOAD ‘Employee.txt’ AS (name:chararray, term:chararray, gpa:float);
grunt> B = GROUP A BY name;
grunt> C = FOREACH B GENERATE A.name, AVG(A.gpa);
grunt> DUMP C;

2) CONCAT (expression, expression):


To concatenate two or more expressions. The generated result of expression must have identical
types. However, if any sub-expression is null, the generated expression is also null.
Ex: fields f1, an underscore string literal, f2 and f3 are concatenated.
grunt> X = LOAD ‘data’ as (f1:chararray, f2:chararray, f3:chararray);
grunt> DUMP X;
(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)
grunt> Y = FOREACH X GENERATE CONCAT(f1, ‘_’, f2,f3);
grunt> DUMP Y;
(apache_opensource)
(hadoop_mapreduce)
(pig_piglatin)

3) COUNT(expression):
To count the number of elements in a bag. It requires a preceding GROUP ALL statement for global
counts and a GROUP BY statement for group counts. It ignores the null values.
Ex: grunt> X = LOAD ‘data’ AS (f1:int,f2:int,f3:int);
grunt> DUMP X;

A. Bhanu Prasad, Associate Professor of CSE, VCE


BIG DATA ANALYTICS LAB
(A7902) (VCE-R21)

(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> Y = GROUP X BY f1;
grunt> DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
grunt> A = FOREACH Y GENERATE COUNT(X);
grunt> DUMP A;
(1L)
(2L)
(1L)
(2L)

4) IsEmpty(expression):
To check if a bag or map is empty.
Ex: grunt> Y = filter X by IsEmpty(SSN_NAME);

5) MAX(expression):
To find out the maximum of the numeric values or chararrays in a single-column bag. It requires a
preceding GROUP ALL statement for global maximums and a GROUP BY statement for group
maximums. However, it ignores the NULL values.
Ex: grunt> X = FOREACH B GENERATE group, MAX(A.gpa);

6) MIN(expression):
To get the minimum (lowest) value (numeric or chararray) for a certain column in a single-column
bag.
Ex: grunt> X = FOREACH B GENERATE group, MIN(A.gpa);

A. Bhanu Prasad, Associate Professor of CSE, VCE


BIG DATA ANALYTICS LAB
(A7902) (VCE-R21)

10.C) Develop a WordCount program using Pig Latin statements.


The input file of Pig contains each tuple/record in individual lines with the entities
separated by a delimiter ( “,”).
Step-1: Create a Directory in HDFS with the name pigdir in the required path using
mkdir:
$ hdfs dfs -mkdir /bdalab/pigdir
Step-2: In the local file system, create an input file wordcount containing data as shown
below.
Deer,Bear,River
Car,Car,River
River,Car,River
Deer,River,Bear
Step-3: Move the file from the local file system to HDFS using put (Or) copyFromLocal
command and verify using -cat command
$ hdfs dfs -put /home/cloudera/pigdir/wordcount_data /bdalab/pigdir/
$ hdfs dfs -cat / bdalab/pigdir/ wordcount_data
Step-4: Open Pig in Grunt shell and execute the following Pig Latin statement.
$ pig => will direct to grunt>
Convert Each line to each tuple.
Apply Relational Operator – LOAD to load the data into Relation lines from the file
wordcount_data.
grunt> lines = LOAD '/bdalab/pigdir/wordcount_data' AS (line:chararray);
grunt> DUMP lines;
(Deer,Bear,River)
(Car,Car,River)
(River,Car,River)
(Deer,River,Bear)
Step-5: Convert Each line tuple to each word tuple
TOKENIZE splits the line into a field for each word.
FLATTEN will take the collection of records returned by TOKENIZE and produce
a separate record for each one, calling the single field in the record word.
FOREACH operator is used to generate specified data transformations based on
the column data.
grunt> words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line,’%’)) as
word;
grunt> dump words;
(Deer)

A. Bhanu Prasad, Associate Professor of CSE, VCE


BIG DATA ANALYTICS LAB
(A7902) (VCE-R21)

(Bear)
(River)
(Car)
(Car)
(River)
(River)
(Car)
(River)
(Deer)
(River)
(Bear)
Step-6: Group all similar words into each tuple
grunt> groupword = GROUP words by word;
grunt> dump groupword;
(Car,{(Car),(Car),(Car)})
(Bear,{(Bear),(Bear)})
(Deer,{(Deer),(Deer)})
(River,{(River),(River),(River),(River),(River)})

Step-6: Count each grouped word and display


grunt> wordcount = FOREACH groupword GENERATE group, COUNT(words);
grunt> dump wordcount;
(Car,3)
(Bear,2)
(Deer,2)
(River,5)

A. Bhanu Prasad, Associate Professor of CSE, VCE

You might also like