Pig
Pig
6)Install and Run Pig then write Pig Latin scripts to sort, group, join,
project, and filter your data.
PROCEDURE:
Download and extract pig-0.13.0.
Command: wget https://archive.apache.org/dist/pig/pig-0.13.0/pig-
0.13.0.tar.gz
Command: tar xvf pig-0.13.0.tar.gz
Command: sudo mv pig-0.13.0 /usr/lib/pig
Set Path for pig
Command: sudo gedit
$HOME/.bashrc
export
PIG_HOME=/usr
/lib/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_COMMON_HOME/conf
pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the
pig.properties file, you can set various parameters as given below.
pig -h properties
Verifying the
Installation
Verify the installation of Apache Pig by typing the version command. If the
installation is successful, you will get the version of Apache Pig as shown
below.
Command: pig -version
grunt>
Grouping Of Data:
put dataset into hadoop
Command: hadoop fs -put pig/input/data.txt pig_data/
Joining Of Data:
Run pig script program of JOIN on hadoop mapreduce
grunt>
customers = LOAD
'hdfs://localhost:8020/user/pcetcse/pig_data/customers.txt'
USING PigStorage(',')as (id:int, name:chararray, age:int,
address:chararray, salary:int);
orders = LOAD
'hdfs://localhost:8020/user/pcetcse/pig_data/orders.txt'
USING PigStorage(',')as (oid:int, date:chararray,
customer_id:int, amount:int);
Grouping data
grunt> group1 = group data by age;
grunt> describe group1;
group1: {group: int,data: {(age: int)}}
grunt> dump group1;
(12,{(12)})
(19,{(19)})
(24,{(24),(24)})
(25,{(25)})
(27,{(27)})
(35,{(35),(35)})
(45,{(45)})
(55,{(55)})
(65,{(65)})
The data bag is grouped by ‘age’ therefore Group element contain unique
values
To see how pig transforms data
grunt > ILLUSTRAGE group1;
Load Command
JOIN
The JOIN operator is used to combine records from two or more relations.
While performing a join operation, we declare one (or a group of) tuple(s) from
each relation, as keys. When these keys match, the two particular tuples are
matched, else the records are dropped. Joins can be of the following types −
Self-join
Inner-join
Outer-join − left join, right join, and full join
Self-join
Self-join is used to join a table with itself.
Inner Join
Default Join is Inner Join – Rows are joined where the keys match – Rows
that do not have matches are not included in the result
Records which will not join with the ‘other’ record-set are still included in the
result
Left Outer – Records from the first data-set are included whether they
have a match or not. Fields from the unmatched (second) bag are set to null.
Right Outer – The opposite of Left Outer Join: Records from the
second data-set are included no matter what. Fields from the
unmatched (first) bag are set to null.
Full Outer – Records from both sides are included. For
unmatched records the fields from the ‘other’ bag are set to null.
cloudera@localhost ~]$ cat>a.txt
1,2,3
4,2,1
8,3,4
4,3,3
7,2,5
8,4,3
[cloudera@localhost ~]$ cat>b.txt
2,4
8,9
1,3
2,7
2,9
4,6
4,9
[cloudera@localhost ~]$ hadoop fs -put a.txt
[cloudera@localhost ~]$ hadoop fs -put b.txt
Self join
Self-join is used to join a table with itself as if the table were two relations,
temporarily renaming at least one relation.
i.e we join one table to itself rather than joining two tables.
grunt> ONE= load 'a.txt' using PigStorage(',') as (a1:int,a2:int,a3:int);
grunt> TWO = load 'a.txt' using PigStorage(',') as (a1:int,a2:int,a3:int);
SELFJ = JOIN ONE by a1 , TWO BY a1;
grunt> describe SELFJ;
SELFJ: {ONE::a1: int,ONE::a2: int,ONE::a3: int,TWO::a1: int,TWO::a2:
int,TWO::a3: int}
Equi-join
inner Join is used quite frequently; it is also referred to as equijoin.
An inner join returns rows when there is a match in both tables.
grunt> A = load 'a.txt' using PigStorage(',') as (a1:int,a2:int,a3:int);
grunt> B = load 'b.txt' using PigStorage(',') as (b1:int,b2:int,b3:int);
grunt> X = Join A by a1, B by b1;
grunt> Dump X;
(1,2,3,1,3,)
(4,2,1,4,6,)
(4,2,1,4,9,)
(4,3,3,4,6,)
(4,3,3,4,9,)
(8,3,4,8,9,)
(8,4,3,8,9,)
Left outer join
A = LOAD ‘A.txt' using PigStorage(',') AS (a1:int,a2:int,a3:int);
B = LOAD, ‘B.txt' using PigStorage(',') AS (b1:int,b2:int);
LEFTJ = JOIN A by a1 LEFT OUTER, B BY b1;
DUMP LEFTJ;
(1,2,3,1,3)
(4,3,3,4,9)
(4,3,3,4,6)
(4,2,1,4,9)
(4,2,1,4,6)
(7,2,5,,)
(8,4,3,8,9)
(8,3,4,8,9)
Right outer join
A = LOAD ‘A.txt' using PigStorage(',') AS (a1:int,a2:int,a3:int);
B = LOAD, ‘B.txt' using PigStorage(',') AS (b1:int,b2:int);
RIGHTJ = JOIN A by a1 RIGHT OUTER, B BY b1;
DUMP RIGHTJ;
(1,2,3,1,3)
(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
Full join
A = LOAD ‘A.txt' using PigStorage(',') AS (a1:int,a2:int,a3:int);
B = LOAD, ‘B.txt' using PigStorage(',') AS (b1:int,b2:int);
FULLJ = JOIN A by a1 FULL, B BY b1;
DUMP FULLJ;
(1,2,3,1,3)
(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(7,2,5,,)
(8,3,4,8,9)
(8,4,3,8,9)
UNION & SPLIT
UNION combines multiple relations together whereas SPLIT partitions a
relation in to multiple ones.
grunt> cat a.txt
1,2,3
4,2,1
8,3,4
grunt> cat b.txt
4,3,3
7,2,5
8,4,3
grunt> a = load 'a.txt' using PigStorage(',') as (a1:int, a2:int, a3:int);
grunt> b = load 'b.txt' using PigStorage(',') as (b1:int, b2:int, b3:int);
grunt> dump a;
(1,2,3)
(4,2,1)
(8,3,4)
grunt> dump b;
(4,3,3)
(7,2,5)
(8,4,3)
grunt> c = UNION a, b;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> SPLIT c into sp1 if $0 == 4, sp2 if $0 == 8;
Split operation on ‘c’ sends a tuple to sp1 if its first field ($0) is 0 , and to sp2 if
it’s 1
grunt> dump sp1;
(4,3,3)
(4,2,1)
grunt > dump sp2;
(8,4,3)
(8,3,4)
grunt> chars = LOAD 'char.txt' AS (c:chararray);
grunt> chargrp = GROUP chars by c;
grunt> dump chargrp;
(a,{(a),(a),(a)})
(c,{(c),(c)})
(i,{(i),(i),(i)})
(k,{(k),(k),(k),(k)})
(l,{(l),(l)})
grunt> describe chargrp;
chargrp: {group: chararray,chars: {(c: chararray)}}
FOREACH with Functions
– Pig comes with many functions including COUNT, FLATTEN,
CONCAT, etc...
– Can implement a custom function
COUNT: