0% found this document useful (0 votes)
38 views17 pages

Pig Setup and Test Run: by Kannan Kalidasan

This document provides instructions for setting up and running a test of Pig Latin on Hadoop. It explains that Pig Latin is a data flow language that allows processing of data on Hadoop without Java code. It describes installing Pig, setting environment variables, and running a sample Pig Latin script to load data from a file, group the data by year, count the records, and store the results. The script demonstrates basic Pig Latin keywords like LOAD, GROUP, FOREACH, and STORE.

Uploaded by

UtibeimaUkoh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views17 pages

Pig Setup and Test Run: by Kannan Kalidasan

This document provides instructions for setting up and running a test of Pig Latin on Hadoop. It explains that Pig Latin is a data flow language that allows processing of data on Hadoop without Java code. It describes installing Pig, setting environment variables, and running a sample Pig Latin script to load data from a file, group the data by year, count the records, and store the results. The script demonstrates basic Pig Latin keywords like LOAD, GROUP, FOREACH, and STORE.

Uploaded by

UtibeimaUkoh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Pig Setup and Test run

By Kannan Kalidasan

Pig Introduction
Pig is a data flow language ( PigLatin ) to write Hadoop operations without using MapReduce Java
code.
Pig is a layer of abstraction on top of Hadoop to simplify its use by giving a SQL-like interface to
process data on Hadoop.
Help to increase productivity by not writing many lines of Java code.
It supports a variety of data types and also support user-defined functions (UDFs) to write custom
operations in Java, Python and JavaScript.
I recommended To learn Programming Pig Allan Gates book.
Author explain the concepts in clear and simple way.

Pig Prompt is GRUNT


pig grunts
$ pig
grunt>

Pig session has two modes


Local Mode : Access to a single machine. All files are installed and run using your local host and file
system.This mode helps to debug the pig script before we process them in clusters. -x flag is used to
specify the mode.
pig -x local
MapReduce Mode : Access to a Hadoop cluster and HDFS installation. MapReduce mode is the
default mode;
To add Hadoop Conf details to Pig Class path
export PIG_CLASSPATH=$HADOOP_HOME/conf/
both below commands are same and Start the pig session in MapReduce mode.
pig or pig -x mapreduce

Note to Remember ...

Hadoop services should be running to start the pig MapReduce mode and connect to HDFS and
proceed with our work.

Pig translates the PigLatin scripts into MapReduce Jobs internally and run in hadoop cluster.

In MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.

Pig Installation
1. Download the stable version of tarbal.
http://mirror.nexcess.net/apache/pig/pig-0.12.0/
pig-0.12.0.tar.gz
Release notes link
http://pig.apache.org/releases.html#Download

Pig Installation ...


2.Copy the downloaded package to /usr/local
/usr/local
kannan@kannandreams:/usr/local$ ls -ltr
total 119460
-rwxr-xr-x 1 root root 63851630 Nov 11 02:11 hadoop-1.2.1.tar.gz
drwxr-xr-x 16 hduser hadoop 4096 Nov 11 23:47 hadoop
-rwxrwxrwx 1 root root 58433159 Dec 3 00:55 pig-0.11.1.tar.gz
kannan@kannandreams:/usr/local$

Pig Installation ...


3. unzip and change the owner
sudo tar xzf pig-0.11.1.tar.gz
sudo mv pig-0.11.1 pig
sudo chown -R hduser:hadoop pig
chown command change the owner of the directory pig from root to hadoop user hduser.

4.Login to Hadoop user hduser and set the environment variables.


kannan@kannandreams:/usr/local$ su hduser
Add the below two lines in ~/.bashrc file.
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin

Pig Installation ...


5. Source the profile file to reflect the changes
hduser@kannandreams:~$ . .bashrc
hduser@kannandreams:~$
6.check the pig command
output of the command mentioned below is not complete one.
hduser@kannandreams:~$ pig -help
Warning: $HADOOP_HOME is deprecated.
Apache Pig version 0.11.1 (r1459641)
compiled Mar 22 2013, 02:13:53

Test Run ...


7. Create a sample file for processing ( file name as pigcsv )
Extension for the file doesnt matter . it will understand based on mime type of the file.
sample file create a file in HDFS directory with the below contents
2006;
2007;
2008;
2008;
2008;
2008;
2007;

Test Run ...


8. Pig Scripts
Method 1 to run the pig script : Save the pig scripts as <<filename>>.pig ( In my case, it is pig_test.
pig ) and run as $ pig -x mapreduce pig_test.pig OR $ pig pig_test.pig
SampleRecord = LOAD /user/hduser/piginput/pigcsv
USING PigStorage(;) AS (Year:chararray);
GroupByYear = GROUP SampleRecord BY Year;
CountByYear = FOREACH GroupByYear
GENERATE CONCAT((chararray)$0,CONCAT(:,(chararray)COUNT($1)));
STORE CountByYear
INTO /user/hduser/pigoutput USING PigStorage(t);

Test Run ...


Method 2 to run the pig script : line ends with ; is considered as one statement
grunt>SampleRecord = LOAD /user/hduser/piginput/pigcsv
>> USING PigStorage(;) AS (Year:chararray);
grunt>GroupByYear = GROUP SampleRecord BY Year;
grunt>CountByYear = FOREACH GroupByYear
>>GENERATE CONCAT((chararray)$0,CONCAT(:,(chararray)COUNT($1)));
grunt>STORE CountByYear
>>INTO /user/hduser/pigoutput USING PigStorage(t);

Test Run ...


9. Output :
hduser@kannandreams:/usr/local/hadoop/bin$ hadoop fs -cat /user/hduser/pigoutput/part-r-00000
Warning: $HADOOP_HOME is deprecated.
2006:1
2007:2
2008:4
Year:1
hduser@kannandreams:/usr/local/hadoop/bin$

Script Explanation
Load the file into a variable by mentioning the delimiter (;) and Header name and its type.
Use comma to include more than one column data available in file.By Default , Pig loads files
delimited by tab. Need to explicitly mention type of delimiter character.
SampleRecord = LOAD /user/hduser/piginput/pigcsv
USING PigStorage(;) AS (Year:chararray);
Group the variable stored data by year
GroupByYear = GROUP SampleRecord BY Year;

Script Explanation ...


Count the records for each group set and generate the output as Key:Value.Its your wish how you
want to generate the file output.$0 is the group by criteria and $1 is the output of the count
CountByYear = FOREACH GroupByYear
GENERATE CONCAT((chararray)$0,CONCAT(:,(chararray)COUNT($1)));
Store the variable in a file
STORE CountByYear
INTO /user/hduser/pigoutput USING PigStorage(t);
For Complete Script commands , refer
http://pig.apache.org/docs/r0.10.0/start.html#data-results

Pig in Cloudera
Pig Editor in Cloudera are explained in my
blog.
http://kannandreams.wordpress.com/2013/12/03/pig-editor-in-cloudera/#!

Thank You !!!


mail : [email protected]
@kannanpoem on twitter
Blog: http://kannandreams.wordpress.com/about/
FB Community: www.facebook.com/groups/huge360/
HUGE - Hadoop User Group & Enthusiasts
Huge , Yes Its All about "BIG" Data
This has been created to build a group to get expertise and experts in Hadoop and Big Data .

You might also like