0% found this document useful (0 votes)

52 views24 pages

Apache Pig: Senthil Kumar A

This document provides an overview of Apache Pig, including: - Pig is a data flow language called Pig Latin that allows abstraction over MapReduce jobs. - Pig was created at Yahoo to allow developers without Java/MapReduce knowledge to analyze large datasets. - Features include joining, sorting, grouping, and user defined functions using Java. - Pig scripts can be run interactively using Grunt or submitted in batch mode. - Common tasks like loading, filtering, grouping, joining, and storing data are demonstrated using Pig Latin statements.

Uploaded by

Babjee Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views24 pages

Apache Pig: Senthil Kumar A

Uploaded by

Babjee Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

training@datadotz.

com 1

APACHE PIG
Senthil Kumar A
[email protected] 2

Introduction
• Abstraction over Mapreduce.
• It is a data-flow language called Pig Latin.
• Pig was originally created at Yahoo! To serve the similar need to
hive.
• Many developers doesn't have the knowledge of
Java/Mapreduce
• Under the covers, PigLatin scripts are turned as a Mapreduce
jobs and runs on the hadoop cluster
• Latest release is 0.12.0
[email protected] 3

Pig Features
• Joining the dataset
• Sorting and aggregation
• Grouping data
• Referring to elements by position(useful for large datasets)
• Creation of UDF using java
[email protected] 4

Installation
• tar –xvf pig-***.tgz
• Set JAVA_HOME
• Set HADOOP_HOME
[email protected] 5

Accessing Pig
• Interactive mode
• Grunt, the Pig shell

• Batch mode
• Submitting a Pig script directly

• Pig server
• Java class, JDBC like interface
[email protected] 6

Grunt- The Pig Shell (bin/pig)

• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• dump F;
[email protected] 7

Alias name to the fields with data types

• A = load '/user/senthil/drugdata' using PigStorage(',') as
(pid:int, pname:chararray, drug:chararray,
gender:chararray, tot_amt:int);

• F = filter A by drug == 'avil';

• dump F;
[email protected] 8

Data Types
• Scalar Types
• int 10
• float 10.0F
• long 10L
• double 10.0
• chararray hello
• bytearray
[email protected] 9

Data formats
• PigStorage
• using field delimited text format
• BinStorage
• Loads/stores relations in HDFS from or to binary files
• TextLoader
• Loads relations in HDFS from a plain text format
• Loads a whole line as single column
• PigDump
• Stores relations in HDFS by writing the toString() representation of
tuples, one per line
[email protected] 10

Store the results

• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• Store F into '/pig_result001’ using PigStorage(',') ;

Store -> writes the data in HDFS directory

[email protected] 11

Viewing the Schema

• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';

• Describe F;
• Describe A;

• Illustrate F;
• Illustrate A;
[email protected] 12

Execution Plan
• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';

• Explain F;
[email protected] 13

Grouping and Sorting

• A =load '/user/senthil/drugdata' using PigStorage(',');
• D = GROUP A by $2;
• sm = foreach D generate group,SUM(A.$4) as s;
• smorder = order sm by s desc;
• dump smorder;
[email protected] 14

Eliminating duplicates
• Select distinct drug from patient;
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• D = foreach A generate drug;

• unique = DISTINCT D;
• Dump unique;
[email protected] 15

Limit, Match, Non-Match and Count

• -- LIMIT
• Reduce the number of o/p records
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = limit A 2;
• dump F;

• --Similar to Like in SQL

• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = filter A by pname matches 'Brandon.*';
• dump F;
[email protected] 16

Cont..
• -- Not matches Brandon
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = filter A by not pname matches 'Brandon.*';
• dump F;

• -- Count
• A =load '/user/senthil/drugdata' using PigStorage(',');
• F = GROUP A ALL;
• sm = foreach F generate COUNT_STAR(A);
• dump sm;
[email protected] 17

Macros in Pig
• DEFINE my_macro(V, col,value) returns B {
$B = FILTER $V BY $col == '$value';
};
• A = load ‘/datagen_10.txt' using PigStorage(',');
• C = my_macro(A,$2,'metacin');
• dump C;
[email protected] 18

Joining Data Sets

• PigLatin supports inner and outer joins of two or more relations.

Inner join --Join two tables by common key

• A =load ‘/datagen_10.txt' using PigStorage(',');
• B = load '/drug.txt' using PigStorage();
• C = join A by $2, B by $0;
• dump C;
[email protected] 19

Outer joins
• Pig can perform left, right, full outer joins(similar to sql)

• A =load ‘/datagen_10.txt' using PigStorage(',');

• B = load '/drug.txt' using PigStorage();
• C = join A by $2 [left outer|right outer|full outer], B by $0;
• Dump C;
[email protected] 20

GROUP vs COGROUP
• GROUP – collects records of one input based on a key
• COGROUP – collects records of n inputs based on a key
• C = COGROUP A by $2, B by $0;
• Dump C;
[email protected] 21

Pig Scripts
• Use Pig scripts to place Pig Latin statements and Pig commands
in a single file.
• Good practice to identify the file using *.Pig
• Can run scripts that are stored in HDFS
• Pig hdfs://path/script.pig
• Single as well as Comment lines can be added
[email protected] 22

Pig Server
• It is not a daemon server
• It is a single threaded stub to run pig in a java application
• org.apache.pig.Pigserver class
• Allows java programs to invoke pig commands
• Use “local” or “mapreduce” to indicate run method
• PigServer
• ps = new PigSrever(“local”);
• ps.registerQuery(“A = load 'file' ”);
• ps.registerQuery(“B = group A by $0 ”);
• ps.store(“B”, “outfile”);
[email protected] 23

Implementation of UPPER UDF

package com;
public class Upper extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;}
try {String str = (String) input.get(0);
return str.toUpperCase();
} catch (IOException e) {
e.getMessage();}
return null;}}
[email protected] 24

THANK YOU

Dashboard Software Engineering Concepts Software Maintenance Pre-Quiz
67% (6)
Dashboard Software Engineering Concepts Software Maintenance Pre-Quiz
2 pages
FoxView Software
100% (2)
FoxView Software
12 pages
ESET 6 Trial Reset PDF
No ratings yet
ESET 6 Trial Reset PDF
2 pages
Old School Value Stock Spreadsheet Manual
100% (3)
Old School Value Stock Spreadsheet Manual
20 pages
Ashwanth: Kumar
No ratings yet
Ashwanth: Kumar
1 page
PTCL An1020 25 User Manual PDF: Read/Download
No ratings yet
PTCL An1020 25 User Manual PDF: Read/Download
2 pages
Chandra Apps Notes
100% (2)
Chandra Apps Notes
93 pages
SMU - Model Papers, Assignments &amp Projects
100% (1)
SMU - Model Papers, Assignments &amp Projects
19 pages
Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
05a-pig
No ratings yet
05a-pig
52 pages
Chapter 5 - Introducing Pig Pig Architecture
No ratings yet
Chapter 5 - Introducing Pig Pig Architecture
81 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Pig
No ratings yet
Pig
16 pages
7 Ibiz Pig Workouts
No ratings yet
7 Ibiz Pig Workouts
7 pages
Thejas Nair Pig Team at Yahoo! Apache Pig PMC Member
No ratings yet
Thejas Nair Pig Team at Yahoo! Apache Pig PMC Member
22 pages
Pig_2
No ratings yet
Pig_2
63 pages
Apache Pig
No ratings yet
Apache Pig
61 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Lab 5
No ratings yet
Lab 5
9 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Unit 5(Pig,Hive,Hbase)
No ratings yet
Unit 5(Pig,Hive,Hbase)
18 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Apache Pig
No ratings yet
Apache Pig
23 pages
bda-unit-4-060115-big-data-analytics-unit-4
No ratings yet
bda-unit-4-060115-big-data-analytics-unit-4
19 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
BDH_practical_08_29
No ratings yet
BDH_practical_08_29
3 pages
Unit 5 Lecture No-2(PIG)
No ratings yet
Unit 5 Lecture No-2(PIG)
101 pages
BDA Unit-4-PPT
No ratings yet
BDA Unit-4-PPT
98 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Unit 4
No ratings yet
Unit 4
29 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
PIG Interview Qusetions
No ratings yet
PIG Interview Qusetions
15 pages
BDA-V
No ratings yet
BDA-V
10 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
No ratings yet
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
58 pages
Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava
46 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
Big_Data_Unit-5
No ratings yet
Big_Data_Unit-5
81 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Unit 5
No ratings yet
Unit 5
16 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Cse 17CS82 M2 S1 PPT
No ratings yet
Cse 17CS82 M2 S1 PPT
35 pages
Unit 5 Lecture No-2(PIG)
No ratings yet
Unit 5 Lecture No-2(PIG)
94 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
UNIT 5-1
No ratings yet
UNIT 5-1
8 pages
BigData Unit 4
No ratings yet
BigData Unit 4
13 pages
Experiment-7 Pig-Script
No ratings yet
Experiment-7 Pig-Script
4 pages
Pig Expt 5
No ratings yet
Pig Expt 5
4 pages
06-Pig-01-Intro-1
No ratings yet
06-Pig-01-Intro-1
23 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Lab 7
No ratings yet
Lab 7
2 pages
Experiment-7 BDA
No ratings yet
Experiment-7 BDA
4 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Pig
No ratings yet
Pig
55 pages
BDC Output 7
No ratings yet
BDC Output 7
9 pages
Apache Pig in noSql Databases
No ratings yet
Apache Pig in noSql Databases
5 pages
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
The Unreasonable Effectiveness of Data by Halevy, Norvig
No ratings yet
The Unreasonable Effectiveness of Data by Halevy, Norvig
5 pages
Hadoop Installation On CentOS PDF
No ratings yet
Hadoop Installation On CentOS PDF
3 pages
6 Flume - Student - Datadotz
No ratings yet
6 Flume - Student - Datadotz
29 pages
The Unreasonable Effectiveness of Data by Halevy, Norvig
No ratings yet
The Unreasonable Effectiveness of Data by Halevy, Norvig
5 pages
Mongodb: Senthil Kumar A
No ratings yet
Mongodb: Senthil Kumar A
21 pages
SSH Key Generation Reference
No ratings yet
SSH Key Generation Reference
6 pages
Sqoop Students Datadotz
No ratings yet
Sqoop Students Datadotz
19 pages
Hbase Shell
No ratings yet
Hbase Shell
10 pages
Datagen 10
No ratings yet
Datagen 10
1 page
United India Insurance Company Limited
No ratings yet
United India Insurance Company Limited
3 pages
HDFS Datadotz
No ratings yet
HDFS Datadotz
22 pages
10th English Book Samacheer Kalvi Guru
0% (1)
10th English Book Samacheer Kalvi Guru
224 pages
Talend Course Content
No ratings yet
Talend Course Content
3 pages
Tamilrock Movies
No ratings yet
Tamilrock Movies
6 pages
Format Dengan CMD
No ratings yet
Format Dengan CMD
3 pages
TechSolve VizAdapter Data Sheet - Heidenhain CNC MTConnect Adapter
No ratings yet
TechSolve VizAdapter Data Sheet - Heidenhain CNC MTConnect Adapter
1 page
The Holy Panchayat by Premchand PDF
No ratings yet
The Holy Panchayat by Premchand PDF
3 pages
Cara Install Adobe Photoshop CS6
No ratings yet
Cara Install Adobe Photoshop CS6
1 page
Mattermost: Installation Steps For Mattermost
No ratings yet
Mattermost: Installation Steps For Mattermost
10 pages
It Sba 2020 PDF
No ratings yet
It Sba 2020 PDF
11 pages
Programming: Just Basic Tutorials
67% (3)
Programming: Just Basic Tutorials
360 pages
Ruud Rietvink CV
No ratings yet
Ruud Rietvink CV
8 pages
Object Oriented Software Engineering Assignment # 02: Instructions
No ratings yet
Object Oriented Software Engineering Assignment # 02: Instructions
5 pages
Eternal Harvest Sheet Music PDF
0% (1)
Eternal Harvest Sheet Music PDF
2 pages
Firmware Upgrade Procedures
No ratings yet
Firmware Upgrade Procedures
2 pages
Motorola Canopy CNUT Tool Review
No ratings yet
Motorola Canopy CNUT Tool Review
6 pages
Airline Project Proposal
No ratings yet
Airline Project Proposal
3 pages
Daily Updated TV Shows From
No ratings yet
Daily Updated TV Shows From
1 page
Structorizer User Guide
33% (3)
Structorizer User Guide
177 pages
Software Constraints For Large Application Systems: The Computer Journal October 1997
No ratings yet
Software Constraints For Large Application Systems: The Computer Journal October 1997
20 pages
Truck Registration
No ratings yet
Truck Registration
24 pages
Bikram Keshari Jena Fresher
No ratings yet
Bikram Keshari Jena Fresher
4 pages
DELL SC Storage Administration and Advanced Management
No ratings yet
DELL SC Storage Administration and Advanced Management
2 pages
DIALux Setup Information
No ratings yet
DIALux Setup Information
26 pages
Oracle Autonomous Database Cloud 2019 Specialist 1z0-931
No ratings yet
Oracle Autonomous Database Cloud 2019 Specialist 1z0-931
19 pages

Apache Pig: Senthil Kumar A

Uploaded by

Apache Pig: Senthil Kumar A

Uploaded by

training@datadotz.

Grunt- The Pig Shell (bin/pig)

Alias name to the fields with data types

• F = filter A by drug == 'avil';

Store the results

Store -> writes the data in HDFS directory

Viewing the Schema

Grouping and Sorting

Limit, Match, Non-Match and Count

• --Similar to Like in SQL

Joining Data Sets

Inner join --Join two tables by common key

• A =load ‘/datagen_10.txt' using PigStorage(',');

Implementation of UPPER UDF

You might also like