0% found this document useful (0 votes)
27 views

Pig

The document provides instructions on installing and running Pig on Hadoop to perform data analysis tasks like sorting, grouping, joining, projecting and filtering data. It includes commands to download and extract Pig, configure environment variables, load and analyze sample data files to demonstrate Pig Latin scripts for each task. The key tasks demonstrated are loading and transforming data with Pig, running Pig in local and MapReduce modes, and using diagnostic tools to understand data flow and transformations.

Uploaded by

Stunt Stunt
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Pig

The document provides instructions on installing and running Pig on Hadoop to perform data analysis tasks like sorting, grouping, joining, projecting and filtering data. It includes commands to download and extract Pig, configure environment variables, load and analyze sample data files to demonstrate Pig Latin scripts for each task. The key tasks demonstrated are loading and transforming data with Pig, running Pig in local and MapReduce modes, and using diagnostic tools to understand data flow and transformations.

Uploaded by

Stunt Stunt
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

EXPERIMENT – 6

6)Install and Run Pig then write Pig Latin scripts to sort, group, join,
project, and filter your data.
PROCEDURE:
 Download and extract pig-0.13.0.
Command: wget https://archive.apache.org/dist/pig/pig-0.13.0/pig-
0.13.0.tar.gz
Command: tar xvf pig-0.13.0.tar.gz
Command: sudo mv pig-0.13.0 /usr/lib/pig
 Set Path for pig
Command: sudo gedit
$HOME/.bashrc
export
PIG_HOME=/usr
/lib/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_COMMON_HOME/conf
 pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the
pig.properties file, you can set various parameters as given below.
pig -h properties
 Verifying the
Installation
Verify the installation of Apache Pig by typing the version command. If the
installation is successful, you will get the version of Apache Pig as shown
below.
Command: pig -version

Local mode MapReduce mode


Command: Command:
$ pig -x local $ pig -x mapreduce
15/09/28 10:13:03 INFO pig.Main: 15/09/28 10:28:46 INFO pig.Main:
Logging error messages to: Logging error messages to:
/home/Hadoop/ /home/Hadoop/
pig_1443415383991.l og pig_1443416326123.l og
2015-09-28 10:13:04,838 2015-09-28 10:28:46,427
[main] INFO [main] INFO
org.apache.pig.backend.hadoop.ex org.apache.pig.backend.hadoop.
ecution engine.HExecutionEngine executi on
- Connecting to hadoop file system engine.HExecutionEngine -
at: file:/// Connecting to hadoop file
grunt> system at: file:///

grunt>

Grouping Of Data:
 put dataset into hadoop
Command: hadoop fs -put pig/input/data.txt pig_data/

 Run pig script program of GROUP on hadoop mapreduce


grunt>
student_details = LOAD
'hdfs://localhost:8020/user/pcetcse/pig_data/student_details.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray, city:chararray);
group_data = GROUP student_details by age;
Dump group_data;
Output:

Joining Of Data:
 Run pig script program of JOIN on hadoop mapreduce
grunt>
customers = LOAD
'hdfs://localhost:8020/user/pcetcse/pig_data/customers.txt'
USING PigStorage(',')as (id:int, name:chararray, age:int,
address:chararray, salary:int);
orders = LOAD
'hdfs://localhost:8020/user/pcetcse/pig_data/orders.txt'
USING PigStorage(',')as (oid:int, date:chararray,
customer_id:int, amount:int);

grunt> coustomer_orders = JOIN customers BY id, orders BY


customer_id;
 Verification
Verify the relation coustomer_orders using the DUMP operator as
shown below.
grunt> Dump coustomer_orders;
 Output
You will get the following output that wills the contents of the relation
named
coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Sorting of Data:
 Run pig script program of SORT on hadoop mapreduce
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/
as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name
student_details as shown below.
grunt>
student_details = LOAD
„hdfs://localhost:8020/user/pcetcse/pig_data/
student_details.txt' USING PigStorage(',')as (id:int,
firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
Let us now sort the relation in a descending order based on the
age of the student and store it into another relation named data
using the ORDER BY operator as shown below.
grunt> order_by_data = ORDER student_details BY age DESC;
 Verification
Verify the relation order_by_data using the DUMP operator as shown
below.
grunt> Dump order_by_data;
 Output
It will produce the following output, displaying the contents of the
relation order_by_data as follows.
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune)
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
Filtering of data:
 Run pig script program of FILTER on hadoop mapreduce
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/
as shown below.
student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name
student_details as shown below.
grunt>
student_details = LOAD
„hdfs://localhost:8020/user/pcetcse/pig_data/
student_details.txt' USING PigStorage(',')as (id:int,
firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
Let us now use the Filter operator to get the details of the students
who belong to the city Chennai.
grunt> filter_data = FILTER student_details BY city == 'Chennai';
 Verification
Verify the relation filter_data using the DUMP operator as shown
below.
grunt> Dump filter_data;
 Output
It will produce the following output, displaying the contents of the
relation filter_data as follows.
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

PIG Running Modes


We can manually override the default mode is –x or –exectype options
$ pig –x local
$pig –x mapreduce
Bulding blocks
- Field – piece of data Ex: “abc”
- Tuple – ordered set of fields represented with “(“ and “)”
Ex : (10.3, abc, 5)
- Bag – collection of tuples representd with “{“ and “}”
Ex: { (10.3, abc, 5), (def, 12,13.5) }

Grouping data
grunt> group1 = group data by age;
grunt> describe group1;
group1: {group: int,data: {(age: int)}}
grunt> dump group1;
(12,{(12)})
(19,{(19)})
(24,{(24),(24)})
(25,{(25)})
(27,{(27)})
(35,{(35),(35)})
(45,{(45)})
(55,{(55)})
(65,{(65)})
The data bag is grouped by ‘age’ therefore Group element contain unique
values
To see how pig transforms data
grunt > ILLUSTRAGE group1;
Load Command

LOAD 'data' [USING function] [AS schema];


• data – name of the directory or file – Must be in single quotes
• USING – specifies the load function to use
– By default uses PigStorage which parses each line into fields
using a delimiter.
- Default delimiter is tab (“\t‟)
• AS – assign a schema to incoming data
– Assigns names to fields
– Declares types to fields
LOADING DATA:
• Create file in local file system
[cloudera@localhost ~]$ cat > a.txt
25
35
45
55
65
24
12
19
27
35
24
• Copy file from local file system to hdfs
[cloudera@localhost ~]$ hadoop fs -put a.txt
Pig Latin – Diagnostic Tools
• Display the structure of the Bag
grunt> DESCRIBE <bag_name>;
ex: DESCRIBE data;
• Display Execution Plan
– Produces Various reports
• Logical Plan
• MapReduce Plan
grunt> EXPLAIN <bag_name>;
ex: EXPLAIN data;
• Illustrate how Pig engine transforms the data
grunt> ILLUSTRATE <bag_name>;
ex: ILLUSTRATE data;
Filter data
grunt> grunt> filter1 = filter data by age > 30;
grunt> dump filter1;
(35)
(45)
(55)
(65)
(35)
grunt> filter2 = filter data by age < 20;
grunt> dump filter2;
(12)
(19)
Sort data
Sort by Ascending order
grunt> sort1 = order data by age ASC;
grunt> dump sort1;
(12)
(19)
(24)
(24)
(25)
(27)
(35)
(35)
(45)
(55)
(65)
Sort by Descending order
grunt> sort2 = order data by age DESC;
grunt> dump sort2;
(65)
(55)
(45)
(35)
(35)
(27)
(25)
(24)
(24)
(19)
(12)
Grouping data
grunt> group1 = group data by age;
grunt> describe group1;
group1: {group: int,data: {(age: int)}}
grunt> dump group1;
(12,{(12)})
(19,{(19)})
(24,{(24),(24)})
(25,{(25)})
(27,{(27)})
(35,{(35),(35)})
(45,{(45)})
(55,{(55)})
(65,{(65)})
The data bag is grouped by ‘age’ therefore Group element contain unique
values
To see how pig transforms data
grunt > ILLUSTRAGE group1;
FOREACH
FOREACH<bag> GENERATE <data>
Iterates over each element in the bag and produce a result.
grunt> records = LOAD ‘std.txt’ USING PigStorage(‘ , ’) AS (roll:int,
name:chararray);
grunt> dump records;
(501,aaa)
(502,hhh)
(507,yyy)
(204,rrr)
(510,bbb)
grunt> stdname = foreach records generate name;
grunt> dump stdname;
(aaa)
(hhh)
(yyy)
(rrr)
(bbb)
grunt> stdroll = foreach records generate roll;
grunt> dump stdroll;
(501)
(502)
(507)
(204)
(510)

JOIN
The JOIN operator is used to combine records from two or more relations.
While performing a join operation, we declare one (or a group of) tuple(s) from
each relation, as keys. When these keys match, the two particular tuples are
matched, else the records are dropped. Joins can be of the following types −
Self-join
Inner-join
Outer-join − left join, right join, and full join
Self-join
Self-join is used to join a table with itself.
Inner Join
Default Join is Inner Join – Rows are joined where the keys match – Rows
that do not have matches are not included in the result
Records which will not join with the ‘other’ record-set are still included in the
result
Left Outer – Records from the first data-set are included whether they
have a match or not. Fields from the unmatched (second) bag are set to null.
Right Outer – The opposite of Left Outer Join: Records from the
second data-set are included no matter what. Fields from the
unmatched (first) bag are set to null.
Full Outer – Records from both sides are included. For
unmatched records the fields from the ‘other’ bag are set to null.
cloudera@localhost ~]$ cat>a.txt
1,2,3
4,2,1
8,3,4
4,3,3
7,2,5
8,4,3
[cloudera@localhost ~]$ cat>b.txt
2,4
8,9
1,3
2,7
2,9
4,6
4,9
[cloudera@localhost ~]$ hadoop fs -put a.txt
[cloudera@localhost ~]$ hadoop fs -put b.txt
Self join
Self-join is used to join a table with itself as if the table were two relations,
temporarily renaming at least one relation.
i.e we join one table to itself rather than joining two tables.
grunt> ONE= load 'a.txt' using PigStorage(',') as (a1:int,a2:int,a3:int);
grunt> TWO = load 'a.txt' using PigStorage(',') as (a1:int,a2:int,a3:int);
SELFJ = JOIN ONE by a1 , TWO BY a1;
grunt> describe SELFJ;
SELFJ: {ONE::a1: int,ONE::a2: int,ONE::a3: int,TWO::a1: int,TWO::a2:
int,TWO::a3: int}
Equi-join
inner Join is used quite frequently; it is also referred to as equijoin.
An inner join returns rows when there is a match in both tables.
grunt> A = load 'a.txt' using PigStorage(',') as (a1:int,a2:int,a3:int);
grunt> B = load 'b.txt' using PigStorage(',') as (b1:int,b2:int,b3:int);
grunt> X = Join A by a1, B by b1;
grunt> Dump X;
(1,2,3,1,3,)
(4,2,1,4,6,)
(4,2,1,4,9,)
(4,3,3,4,6,)
(4,3,3,4,9,)
(8,3,4,8,9,)
(8,4,3,8,9,)
Left outer join
A = LOAD ‘A.txt' using PigStorage(',') AS (a1:int,a2:int,a3:int);
B = LOAD, ‘B.txt' using PigStorage(',') AS (b1:int,b2:int);
LEFTJ = JOIN A by a1 LEFT OUTER, B BY b1;
DUMP LEFTJ;
(1,2,3,1,3)
(4,3,3,4,9)
(4,3,3,4,6)
(4,2,1,4,9)
(4,2,1,4,6)
(7,2,5,,)
(8,4,3,8,9)
(8,3,4,8,9)
Right outer join
A = LOAD ‘A.txt' using PigStorage(',') AS (a1:int,a2:int,a3:int);
B = LOAD, ‘B.txt' using PigStorage(',') AS (b1:int,b2:int);
RIGHTJ = JOIN A by a1 RIGHT OUTER, B BY b1;
DUMP RIGHTJ;
(1,2,3,1,3)
(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
Full join
A = LOAD ‘A.txt' using PigStorage(',') AS (a1:int,a2:int,a3:int);
B = LOAD, ‘B.txt' using PigStorage(',') AS (b1:int,b2:int);
FULLJ = JOIN A by a1 FULL, B BY b1;
DUMP FULLJ;
(1,2,3,1,3)
(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(7,2,5,,)
(8,3,4,8,9)
(8,4,3,8,9)
UNION & SPLIT
UNION combines multiple relations together whereas SPLIT partitions a
relation in to multiple ones.
grunt> cat a.txt
1,2,3
4,2,1
8,3,4
grunt> cat b.txt
4,3,3
7,2,5
8,4,3
grunt> a = load 'a.txt' using PigStorage(',') as (a1:int, a2:int, a3:int);
grunt> b = load 'b.txt' using PigStorage(',') as (b1:int, b2:int, b3:int);
grunt> dump a;
(1,2,3)
(4,2,1)
(8,3,4)
grunt> dump b;
(4,3,3)
(7,2,5)
(8,4,3)
grunt> c = UNION a, b;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> SPLIT c into sp1 if $0 == 4, sp2 if $0 == 8;
Split operation on ‘c’ sends a tuple to sp1 if its first field ($0) is 0 , and to sp2 if
it’s 1
grunt> dump sp1;
(4,3,3)
(4,2,1)
grunt > dump sp2;
(8,4,3)
(8,3,4)
grunt> chars = LOAD 'char.txt' AS (c:chararray);
grunt> chargrp = GROUP chars by c;
grunt> dump chargrp;
(a,{(a),(a),(a)})
(c,{(c),(c)})
(i,{(i),(i),(i)})
(k,{(k),(k),(k),(k)})
(l,{(l),(l)})
grunt> describe chargrp;
chargrp: {group: chararray,chars: {(c: chararray)}}
FOREACH with Functions
– Pig comes with many functions including COUNT, FLATTEN,
CONCAT, etc...
– Can implement a custom function
COUNT:

grunt> counts = FOREACH chargrp GENERATE group, COUNT(chars);


(a,3)
(c,2)
(i,3)
(k,4)
(l,2)

You might also like