Pig
Pig
PIG Architecture
=================================================================
Two modes :
1) Local - use for files on local system. processing is done on local machine
2)Map reduce - use for HDFS i.e shared file system. processing is done in
distributed manner i.e on multiple michines/nodes.
3)Hcatlog - pig -useHCatlog
=================================================================
DATA INGESTION (SCHEMALESS LOADING)
=================================================================
To go to pig :
pig -x local
DUMP EMP
SELECT = GENERATE
FROM = FOREACH
=================================================================
DATA INGESTION (SCHEMA BASED LOADING)
=================================================================
=================================================================
FILTERING
=================================================================
OR
E = UNION B1,C1,D1;
Dump G ;
=================================================================
LOAD DATA WITH MULTIPLE DELIMITER
=================================================================
When there is multiple delimiter present in file like (,:..) then we first load the
file to pig without PigStorage and then using REGEX_EXTRACT_ALL function we divide
the fields in coma seperated
DUMP B;
DUMP C;
==============================================================
SCRIPT TO FILTER ONLY ERROR CODE AND ERROR MESSAGE FROM THE LOG FILE :
LOG = LOAD '/home/itelligence/pig_1491722865605.log';
DUMP B;
=================================================================
JOINS
=================================================================
FOR LEFT OUTER AND RIGHT OUTER JOINS YOU NEED TO SPECIFY THE SECEMA OF THE FILES
WHILE LOADING THE FILES
FOR LEFT OUTER SPECIFY SCHEMA FOR LEFT FILE AND VICE VERSA
REPLICATED JOIN :
IN THIS CASE WHEN WE JOIN THE TWO TABLES THE SMALLER TABLE IS COPPIED TO THE NODE
WHERE ALL THE BOLCKS OF THE LARGER TABLE IS PLACED SO THAT LOOK UP CAN BE
PERFORMED. USED FOR PERFORMANCE OPTIMIZATION.
SKEWED JOIN :
MERG JOIN :
============================SCHEMA ON READ=======================
=================================================================
COGROUP
=================================================================
CREATE PROJECT ==> ADD LIBRARIES(HADOOP LIBRARIES,PIG LIBRARIES) ==> CREATE PACKAGE
==> CREATE CLASS ==> PASTE PROGRAM ==> EXPORT TO JAR
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class ToUpper extends EvalFunc<String>
{ public String exec(Tuple input) throws IOException
{
if (input == null || input.size() == 0)
return null;
else {
String str = input.get(0).toString();
return str.toUpperCase();
}
}
=================================================================
PIGYBANK
=================================================================
contains predefined UDF. First register the piggybank in pig and then we can use
any UDF in pigybank
=================================================================
XML FILE LOADING TO PIG
=================================================================
B = FOREACH A GENERATE
FLATTEN(REGEX_EXTRACT_ALL($0,'<property>\\s*<name>(.*)</name>\\s*<value>(.*)</
value>\\s*</property>'));
(writing shell comands in file and then executing file directly either from
hadoop($) prompt or grunt prompt)
from $ prompt :
exec /home/itelligence/Desktop/script_ns.pig
=================================================================
WORD COUNT IN FLAT FILE
=================================================================
C = GROUP B BY $0;
=======================================================
START-UP
DESCRIBE employees;
==============================================================
JOIN
B = GROUP A by product;
C = FOREACH B {
LOCS = DISTINCT A.location;
GENERATE group, COUNT(LOCS) as location_count;
};
DUMP C
==============================================================
WORDCOUNT
A = load '------/input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
store D into '------/wordcount';
==============================================================
-- Problem Stmt : find the number of items bought by each customer
-- which item he/she bought highest time.
-- load the input data :: Schema ( customerId , itemId , order Date, delivery
Date );
-- group by cstrId
==============================================================