ABP W9-W10 Big Data Analytics Lab-PIG
ABP W9-W10 Big Data Analytics Lab-PIG
(A7902) (VCE-R21)
For the given Student dataset and Employee dataset, perform Relational operations like
Loading, Storing, Diagnostic Operations (Dump, Describe, Illustrate & Explain) in Hadoop
Pig framework using Cloudera
Student ID First Name Age City CGPA
001 Jagruthi 21 Hyderabad 9.1
002 Praneeth 22 Chennai 8.6
003 Sujith 22 Mumbai 7.8
004 Sreeja 21 Bengaluru 9.2
005 Mahesh 24 Hyderabad 8.8
006 Rohit 22 Chennai 7.8
007 Sindhu 23 Mumbai 8.3
Step-1: Create a Directory in HDFS with the name pigdir in the required path using mkdir:
$ hdfs dfs -mkdir /bdalab/pigdir
Step-2: The input file of Pig contains each tuple/record in individual lines with the entities
separated by a delimiter ( “,”).
In the local file system, create an input In the local file system, create an input
file student_data.txt containing data as file employee_data.txt containing data
shown below. as shown below.
001,Jagruthi,21,Hyderabad,9.1 001,Angelina,22,LosAngeles
002,Praneeth,22,Chennai,8.6 002,Jackie,23,Beijing
003,Sujith,22,Mumbai,7.8 003,Deepika,22,Mumbai
004,Sreeja,21,Bengaluru,9.2 004,Pawan,24,Hyderabad
005,Mahesh,24,Hyderabad,8.8 005,Rajani,21,Chennai
006,Rohit,22,Chennai,7.8 006,Amitabh,22,Mumbai
007,Sindhu,23,Mumbai,8.3
Step-3: Move the file from the local file system to HDFS using put (Or) copyFromLocal
command and verify using -cat command
$ hdfs dfs -put /home/cloudera/pigdir/student_data /bdalab/pigdir/
$ hdfs dfs -cat /bdalab/pigdir/student_data
$ hdfs dfs -put /home/cloudera/pigdir/employee_data /bdalab/pigdir/
$ hdfs dfs -cat /bdalab/pigdir/employee_data
Step-4: Apply Relational Operator – LOAD to load the data from the file
student_data.txt into Pig by executing the following Pig Latin statement in the
Grunt shell. Relational Operators are NOT case sensitive.
$ pig => will direct to grunt> shell
grunt> student = LOAD '/bdalab/pigdir/student_data.txt' USING PigStorage(',')
as ( id:int, name:chararray, age:int, city:chararray, cgpa:double );
grunt> employee = LOAD '/bdalab/pigdir/employee_data.txt’ USING
PigStorage(',') as ( id:int, name:chararray, age:int, city:chararray);
Step-5: Apply Relational Operator – STORE to Store the relation in the HDFS directory
“/pig_output/” as shown below.
grunt> STORE student INTO '/bdalab/pigdir/pig_output/ ' USING PigStorage (',');
grunt> STORE employee INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage
(',');
• SELF-Join
• INNER-Join
OUTER JOIN returns all the rows from at least one of the relations. An outer join operation
is carried out in three ways −
• LEFT OUTER JOIN
• RIGHT OUTER JOIN
• FULL OUTER JOIN
Step-8: LEFT OUTER JOIN operation returns all rows from the left table, even if there are
no matches in the right relation.
Note: Student_data is LEFT
grunt> outerleft_data_att = JOIN std_data BY id LEFT, std_att BY id;
grunt> DUMP outerleft_data_att
(1,Jagruthi,21,Hyderabad,9.1,1,Jagruthi,joined,9:10:10)
(2,Praneeth,22,Chennai,8.6,,,,)
(3,Sujith,22,Mumbai,7.8,,,,)
(4,Sreeja,21,Bengaluru,9.2,4,Sreeja,joined,9:10:24)
(5,Mahesh,24,Hyderabad,8.8,,,,)
(6,Rohit,22,Chennai,7.8,6,Rohit,joined,9:11:15)
(7,Sindhu,23,Mumbai,8.3,7,Sindhu,joined,9:12:25)
Note: Student_att is LEFT
grunt> outerleft_att_data = JOIN std_att BY id LEFT, std_data BY id;
grunt> DUMP outerleft_att_data;
(1,Jagruthi,joined,9:10:10,1,Jagruthi,21,Hyderabad,9.1)
(4,Sreeja,joined,9:10:24,4,Sreeja,21,Bengaluru,9.2)
(6,Rohit,joined,9:11:15,6,Rohit,22,Chennai,7.8)
(7,Sindhu,joined,9:12:25,7,Sindhu,23,Mumbai,8.3)
(8,Sai,joined,9.14:18,,,,,)
(9,Meghana,joined,9.15:25,,,,,)
Step-9: RIGHT OUTER JOIN operation returns all rows from the right table, even if there
are no matches in the left table.
grunt> outerright_data_att = JOIN std_data BY id RIGHT, std_att BY id;
grunt> DUMP outerright_data_att;
(1,Jagruthi,21,Hyderabad,9.1,1,Jagruthi,joined,9:10:10)
(4,Sreeja,21,Bengaluru,9.2,4,Sreeja,joined,9:10:24)
(6,Rohit,22,Chennai,7.8,6,Rohit,joined,9:11:15)
(7,Sindhu,23,Mumbai,8.3,7,Sindhu,joined,9:12:25)
(,,,,,8,Sai,joined,9.14:18)
(,,,,,9,Meghana,joined,9.15:25)
Step-10: FULL OUTER JOIN operation returns rows when there is a match in one of the
relations.
grunt> outerfull_data_att = JOIN std_data BY id FULL, std_att BY id;
grunt> DUMP outerfull_data_att;
(1,Jagruthi,21,Hyderabad,9.1,1,Jagruthi,joined,9:10:10)
(2,Praneeth,22,Chennai,8.6,,,,)
(3,Sujith,22,Mumbai,7.8,,,,)
(4,Sreeja,21,Bengaluru,9.2,4,Sreeja,joined,9:10:24)
(5,Mahesh,24,Hyderabad,8.8,,,,)
(6,Rohit,22,Chennai,7.8,6,Rohit,joined,9:11:15)
(7,Sindhu,23,Mumbai,8.3,7,Sindhu,joined,9:12:25)
(,,,,,8,Sai,joined,9.14:18)
(,,,,,9,Meghana,joined,9.15:25)
Step-11: FILTER operator is used to select the required tuples from a relation based
on a condition
grunt> filter_std = FILTER std_data BY city == 'Hyderabad';
grunt> DUMP filter_std;
(1,Jagruthi,21,Hyderabad,9.1)
(5,Mahesh,24,Hyderabad,8.8)
Step-12: SPLIT operator is used to split a relation into two or more relations
grunt> SPLIT std_data INTO split_std1 IF age<23, split_std2 IF (age>22 AND
age<25);
grunt> DUMP split_std1;
(1,Jagruthi,21,Hyderabad,9.1)
(2,Praneeth,22,Chennai,8.6)
(3,Sujith,22,Mumbai,7.8)
(4,Sreeja,21,Bengaluru,9.2)
(6,Rohit,22,Chennai,7.8)
grunt> DUMP split_std2;
(5,Mahesh,24,Hyderabad,8.8)
(7,Sindhu,23,Mumbai,8.3)
1) AVG(expression):
To compute the average of the numerical values within a bag. It requires a preceding GROUP ALL
statement for global averages and a GROUP BY statement for group averages. However, it ignores
the NULL values.
Ex: Average GPA for each Employee is computed
grunt> A = LOAD ‘Employee.txt’ AS (name:chararray, term:chararray, gpa:float);
grunt> B = GROUP A BY name;
grunt> C = FOREACH B GENERATE A.name, AVG(A.gpa);
grunt> DUMP C;
3) COUNT(expression):
To count the number of elements in a bag. It requires a preceding GROUP ALL statement for global
counts and a GROUP BY statement for group counts. It ignores the null values.
Ex: grunt> X = LOAD ‘data’ AS (f1:int,f2:int,f3:int);
grunt> DUMP X;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> Y = GROUP X BY f1;
grunt> DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
grunt> A = FOREACH Y GENERATE COUNT(X);
grunt> DUMP A;
(1L)
(2L)
(1L)
(2L)
4) IsEmpty(expression):
To check if a bag or map is empty.
Ex: grunt> Y = filter X by IsEmpty(SSN_NAME);
5) MAX(expression):
To find out the maximum of the numeric values or chararrays in a single-column bag. It requires a
preceding GROUP ALL statement for global maximums and a GROUP BY statement for group
maximums. However, it ignores the NULL values.
Ex: grunt> X = FOREACH B GENERATE group, MAX(A.gpa);
6) MIN(expression):
To get the minimum (lowest) value (numeric or chararray) for a certain column in a single-column
bag.
Ex: grunt> X = FOREACH B GENERATE group, MIN(A.gpa);
(Bear)
(River)
(Car)
(Car)
(River)
(River)
(Car)
(River)
(Deer)
(River)
(Bear)
Step-6: Group all similar words into each tuple
grunt> groupword = GROUP words by word;
grunt> dump groupword;
(Car,{(Car),(Car),(Car)})
(Bear,{(Bear),(Bear)})
(Deer,{(Deer),(Deer)})
(River,{(River),(River),(River),(River),(River)})