0% found this document useful (0 votes)
52 views24 pages

Apache Pig: Senthil Kumar A

This document provides an overview of Apache Pig, including: - Pig is a data flow language called Pig Latin that allows abstraction over MapReduce jobs. - Pig was created at Yahoo to allow developers without Java/MapReduce knowledge to analyze large datasets. - Features include joining, sorting, grouping, and user defined functions using Java. - Pig scripts can be run interactively using Grunt or submitted in batch mode. - Common tasks like loading, filtering, grouping, joining, and storing data are demonstrated using Pig Latin statements.

Uploaded by

Babjee Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views24 pages

Apache Pig: Senthil Kumar A

This document provides an overview of Apache Pig, including: - Pig is a data flow language called Pig Latin that allows abstraction over MapReduce jobs. - Pig was created at Yahoo to allow developers without Java/MapReduce knowledge to analyze large datasets. - Features include joining, sorting, grouping, and user defined functions using Java. - Pig scripts can be run interactively using Grunt or submitted in batch mode. - Common tasks like loading, filtering, grouping, joining, and storing data are demonstrated using Pig Latin statements.

Uploaded by

Babjee Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

training@datadotz.

com 1

APACHE PIG
Senthil Kumar A
[email protected] 2

Introduction
• Abstraction over Mapreduce.
• It is a data-flow language called Pig Latin.
• Pig was originally created at Yahoo! To serve the similar need to
hive.
• Many developers doesn't have the knowledge of
Java/Mapreduce
• Under the covers, PigLatin scripts are turned as a Mapreduce
jobs and runs on the hadoop cluster
• Latest release is 0.12.0
[email protected] 3

Pig Features
• Joining the dataset
• Sorting and aggregation
• Grouping data
• Referring to elements by position(useful for large datasets)
• Creation of UDF using java
[email protected] 4

Installation
• tar –xvf pig-***.tgz
• Set JAVA_HOME
• Set HADOOP_HOME
[email protected] 5

Accessing Pig
• Interactive mode
• Grunt, the Pig shell

• Batch mode
• Submitting a Pig script directly

• Pig server
• Java class, JDBC like interface
[email protected] 6

Grunt- The Pig Shell (bin/pig)


• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• dump F;
[email protected] 7

Alias name to the fields with data types


• A = load '/user/senthil/drugdata' using PigStorage(',') as
(pid:int, pname:chararray, drug:chararray,
gender:chararray, tot_amt:int);

• F = filter A by drug == 'avil';


• dump F;
[email protected] 8

Data Types
• Scalar Types
• int 10
• float 10.0F
• long 10L
• double 10.0
• chararray hello
• bytearray
[email protected] 9

Data formats
• PigStorage
• using field delimited text format
• BinStorage
• Loads/stores relations in HDFS from or to binary files
• TextLoader
• Loads relations in HDFS from a plain text format
• Loads a whole line as single column
• PigDump
• Stores relations in HDFS by writing the toString() representation of
tuples, one per line
[email protected] 10

Store the results


• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• Store F into '/pig_result001’ using PigStorage(',') ;

Store -> writes the data in HDFS directory


[email protected] 11

Viewing the Schema


• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';

• Describe F;
• Describe A;

• Illustrate F;
• Illustrate A;
[email protected] 12

Execution Plan
• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';

• Explain F;
[email protected] 13

Grouping and Sorting


• A =load '/user/senthil/drugdata' using PigStorage(',');
• D = GROUP A by $2;
• sm = foreach D generate group,SUM(A.$4) as s;
• smorder = order sm by s desc;
• dump smorder;
[email protected] 14

Eliminating duplicates
• Select distinct drug from patient;
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• D = foreach A generate drug;

• unique = DISTINCT D;
• Dump unique;
[email protected] 15

Limit, Match, Non-Match and Count


• -- LIMIT
• Reduce the number of o/p records
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = limit A 2;
• dump F;

• --Similar to Like in SQL


• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = filter A by pname matches 'Brandon.*';
• dump F;
[email protected] 16

Cont..
• -- Not matches Brandon
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = filter A by not pname matches 'Brandon.*';
• dump F;

• -- Count
• A =load '/user/senthil/drugdata' using PigStorage(',');
• F = GROUP A ALL;
• sm = foreach F generate COUNT_STAR(A);
• dump sm;
[email protected] 17

Macros in Pig
• DEFINE my_macro(V, col,value) returns B {
$B = FILTER $V BY $col == '$value';
};
• A = load ‘/datagen_10.txt' using PigStorage(',');
• C = my_macro(A,$2,'metacin');
• dump C;
[email protected] 18

Joining Data Sets


• PigLatin supports inner and outer joins of two or more relations.

Inner join --Join two tables by common key


• A =load ‘/datagen_10.txt' using PigStorage(',');
• B = load '/drug.txt' using PigStorage();
• C = join A by $2, B by $0;
• dump C;
[email protected] 19

Outer joins
• Pig can perform left, right, full outer joins(similar to sql)

• A =load ‘/datagen_10.txt' using PigStorage(',');


• B = load '/drug.txt' using PigStorage();
• C = join A by $2 [left outer|right outer|full outer], B by $0;
• Dump C;
[email protected] 20

GROUP vs COGROUP
• GROUP – collects records of one input based on a key
• COGROUP – collects records of n inputs based on a key
• C = COGROUP A by $2, B by $0;
• Dump C;
[email protected] 21

Pig Scripts
• Use Pig scripts to place Pig Latin statements and Pig commands
in a single file.
• Good practice to identify the file using *.Pig
• Can run scripts that are stored in HDFS
• Pig hdfs://path/script.pig
• Single as well as Comment lines can be added
[email protected] 22

Pig Server
• It is not a daemon server
• It is a single threaded stub to run pig in a java application
• org.apache.pig.Pigserver class
• Allows java programs to invoke pig commands
• Use “local” or “mapreduce” to indicate run method
• PigServer
• ps = new PigSrever(“local”);
• ps.registerQuery(“A = load 'file' ”);
• ps.registerQuery(“B = group A by $0 ”);
• ps.store(“B”, “outfile”);
[email protected] 23

Implementation of UPPER UDF


package com;
public class Upper extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;}
try {String str = (String) input.get(0);
return str.toUpperCase();
} catch (IOException e) {
e.getMessage();}
return null;}}
[email protected] 24

THANK YOU

You might also like