0% found this document useful (0 votes)
41 views

Apache Pig: Senthil Kumar A

This document provides an overview of Apache Pig, including: - Pig is a data flow language called Pig Latin that allows abstraction over MapReduce jobs. - Pig was created at Yahoo to allow developers without Java/MapReduce knowledge to analyze large datasets. - Features include joining, sorting, grouping, and user defined functions using Java. - Pig scripts can be run interactively using Grunt or submitted in batch mode. - Common tasks like loading, filtering, grouping, joining, and storing data are demonstrated using Pig Latin statements.

Uploaded by

Babjee Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Apache Pig: Senthil Kumar A

This document provides an overview of Apache Pig, including: - Pig is a data flow language called Pig Latin that allows abstraction over MapReduce jobs. - Pig was created at Yahoo to allow developers without Java/MapReduce knowledge to analyze large datasets. - Features include joining, sorting, grouping, and user defined functions using Java. - Pig scripts can be run interactively using Grunt or submitted in batch mode. - Common tasks like loading, filtering, grouping, joining, and storing data are demonstrated using Pig Latin statements.

Uploaded by

Babjee Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

training@datadotz.

com 1

APACHE PIG
Senthil Kumar A
[email protected] 2

Introduction
• Abstraction over Mapreduce.
• It is a data-flow language called Pig Latin.
• Pig was originally created at Yahoo! To serve the similar need to
hive.
• Many developers doesn't have the knowledge of
Java/Mapreduce
• Under the covers, PigLatin scripts are turned as a Mapreduce
jobs and runs on the hadoop cluster
• Latest release is 0.12.0
[email protected] 3

Pig Features
• Joining the dataset
• Sorting and aggregation
• Grouping data
• Referring to elements by position(useful for large datasets)
• Creation of UDF using java
[email protected] 4

Installation
• tar –xvf pig-***.tgz
• Set JAVA_HOME
• Set HADOOP_HOME
[email protected] 5

Accessing Pig
• Interactive mode
• Grunt, the Pig shell

• Batch mode
• Submitting a Pig script directly

• Pig server
• Java class, JDBC like interface
[email protected] 6

Grunt- The Pig Shell (bin/pig)


• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• dump F;
[email protected] 7

Alias name to the fields with data types


• A = load '/user/senthil/drugdata' using PigStorage(',') as
(pid:int, pname:chararray, drug:chararray,
gender:chararray, tot_amt:int);

• F = filter A by drug == 'avil';


• dump F;
[email protected] 8

Data Types
• Scalar Types
• int 10
• float 10.0F
• long 10L
• double 10.0
• chararray hello
• bytearray
[email protected] 9

Data formats
• PigStorage
• using field delimited text format
• BinStorage
• Loads/stores relations in HDFS from or to binary files
• TextLoader
• Loads relations in HDFS from a plain text format
• Loads a whole line as single column
• PigDump
• Stores relations in HDFS by writing the toString() representation of
tuples, one per line
[email protected] 10

Store the results


• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• Store F into '/pig_result001’ using PigStorage(',') ;

Store -> writes the data in HDFS directory


[email protected] 11

Viewing the Schema


• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';

• Describe F;
• Describe A;

• Illustrate F;
• Illustrate A;
[email protected] 12

Execution Plan
• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';

• Explain F;
[email protected] 13

Grouping and Sorting


• A =load '/user/senthil/drugdata' using PigStorage(',');
• D = GROUP A by $2;
• sm = foreach D generate group,SUM(A.$4) as s;
• smorder = order sm by s desc;
• dump smorder;
[email protected] 14

Eliminating duplicates
• Select distinct drug from patient;
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• D = foreach A generate drug;

• unique = DISTINCT D;
• Dump unique;
[email protected] 15

Limit, Match, Non-Match and Count


• -- LIMIT
• Reduce the number of o/p records
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = limit A 2;
• dump F;

• --Similar to Like in SQL


• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = filter A by pname matches 'Brandon.*';
• dump F;
[email protected] 16

Cont..
• -- Not matches Brandon
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = filter A by not pname matches 'Brandon.*';
• dump F;

• -- Count
• A =load '/user/senthil/drugdata' using PigStorage(',');
• F = GROUP A ALL;
• sm = foreach F generate COUNT_STAR(A);
• dump sm;
[email protected] 17

Macros in Pig
• DEFINE my_macro(V, col,value) returns B {
$B = FILTER $V BY $col == '$value';
};
• A = load ‘/datagen_10.txt' using PigStorage(',');
• C = my_macro(A,$2,'metacin');
• dump C;
[email protected] 18

Joining Data Sets


• PigLatin supports inner and outer joins of two or more relations.

Inner join --Join two tables by common key


• A =load ‘/datagen_10.txt' using PigStorage(',');
• B = load '/drug.txt' using PigStorage();
• C = join A by $2, B by $0;
• dump C;
[email protected] 19

Outer joins
• Pig can perform left, right, full outer joins(similar to sql)

• A =load ‘/datagen_10.txt' using PigStorage(',');


• B = load '/drug.txt' using PigStorage();
• C = join A by $2 [left outer|right outer|full outer], B by $0;
• Dump C;
[email protected] 20

GROUP vs COGROUP
• GROUP – collects records of one input based on a key
• COGROUP – collects records of n inputs based on a key
• C = COGROUP A by $2, B by $0;
• Dump C;
[email protected] 21

Pig Scripts
• Use Pig scripts to place Pig Latin statements and Pig commands
in a single file.
• Good practice to identify the file using *.Pig
• Can run scripts that are stored in HDFS
• Pig hdfs://path/script.pig
• Single as well as Comment lines can be added
[email protected] 22

Pig Server
• It is not a daemon server
• It is a single threaded stub to run pig in a java application
• org.apache.pig.Pigserver class
• Allows java programs to invoke pig commands
• Use “local” or “mapreduce” to indicate run method
• PigServer
• ps = new PigSrever(“local”);
• ps.registerQuery(“A = load 'file' ”);
• ps.registerQuery(“B = group A by $0 ”);
• ps.store(“B”, “outfile”);
[email protected] 23

Implementation of UPPER UDF


package com;
public class Upper extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;}
try {String str = (String) input.get(0);
return str.toUpperCase();
} catch (IOException e) {
e.getMessage();}
return null;}}
[email protected] 24

THANK YOU

You might also like