Apache Pig: Senthil Kumar A
Apache Pig: Senthil Kumar A
com 1
APACHE PIG
Senthil Kumar A
[email protected] 2
Introduction
• Abstraction over Mapreduce.
• It is a data-flow language called Pig Latin.
• Pig was originally created at Yahoo! To serve the similar need to
hive.
• Many developers doesn't have the knowledge of
Java/Mapreduce
• Under the covers, PigLatin scripts are turned as a Mapreduce
jobs and runs on the hadoop cluster
• Latest release is 0.12.0
[email protected] 3
Pig Features
• Joining the dataset
• Sorting and aggregation
• Grouping data
• Referring to elements by position(useful for large datasets)
• Creation of UDF using java
[email protected] 4
Installation
• tar –xvf pig-***.tgz
• Set JAVA_HOME
• Set HADOOP_HOME
[email protected] 5
Accessing Pig
• Interactive mode
• Grunt, the Pig shell
• Batch mode
• Submitting a Pig script directly
• Pig server
• Java class, JDBC like interface
[email protected] 6
Data Types
• Scalar Types
• int 10
• float 10.0F
• long 10L
• double 10.0
• chararray hello
• bytearray
[email protected] 9
Data formats
• PigStorage
• using field delimited text format
• BinStorage
• Loads/stores relations in HDFS from or to binary files
• TextLoader
• Loads relations in HDFS from a plain text format
• Loads a whole line as single column
• PigDump
• Stores relations in HDFS by writing the toString() representation of
tuples, one per line
[email protected] 10
• Describe F;
• Describe A;
• Illustrate F;
• Illustrate A;
[email protected] 12
Execution Plan
• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• Explain F;
[email protected] 13
Eliminating duplicates
• Select distinct drug from patient;
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• D = foreach A generate drug;
• unique = DISTINCT D;
• Dump unique;
[email protected] 15
Cont..
• -- Not matches Brandon
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = filter A by not pname matches 'Brandon.*';
• dump F;
• -- Count
• A =load '/user/senthil/drugdata' using PigStorage(',');
• F = GROUP A ALL;
• sm = foreach F generate COUNT_STAR(A);
• dump sm;
[email protected] 17
Macros in Pig
• DEFINE my_macro(V, col,value) returns B {
$B = FILTER $V BY $col == '$value';
};
• A = load ‘/datagen_10.txt' using PigStorage(',');
• C = my_macro(A,$2,'metacin');
• dump C;
[email protected] 18
Outer joins
• Pig can perform left, right, full outer joins(similar to sql)
GROUP vs COGROUP
• GROUP – collects records of one input based on a key
• COGROUP – collects records of n inputs based on a key
• C = COGROUP A by $2, B by $0;
• Dump C;
[email protected] 21
Pig Scripts
• Use Pig scripts to place Pig Latin statements and Pig commands
in a single file.
• Good practice to identify the file using *.Pig
• Can run scripts that are stored in HDFS
• Pig hdfs://path/script.pig
• Single as well as Comment lines can be added
[email protected] 22
Pig Server
• It is not a daemon server
• It is a single threaded stub to run pig in a java application
• org.apache.pig.Pigserver class
• Allows java programs to invoke pig commands
• Use “local” or “mapreduce” to indicate run method
• PigServer
• ps = new PigSrever(“local”);
• ps.registerQuery(“A = load 'file' ”);
• ps.registerQuery(“B = group A by $0 ”);
• ps.store(“B”, “outfile”);
[email protected] 23
THANK YOU