Pig Setup and Test Run: by Kannan Kalidasan
Pig Setup and Test Run: by Kannan Kalidasan
By Kannan Kalidasan
Pig Introduction
Pig is a data flow language ( PigLatin ) to write Hadoop operations without using MapReduce Java
code.
Pig is a layer of abstraction on top of Hadoop to simplify its use by giving a SQL-like interface to
process data on Hadoop.
Help to increase productivity by not writing many lines of Java code.
It supports a variety of data types and also support user-defined functions (UDFs) to write custom
operations in Java, Python and JavaScript.
I recommended To learn Programming Pig Allan Gates book.
Author explain the concepts in clear and simple way.
Hadoop services should be running to start the pig MapReduce mode and connect to HDFS and
proceed with our work.
Pig translates the PigLatin scripts into MapReduce Jobs internally and run in hadoop cluster.
In MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.
Pig Installation
1. Download the stable version of tarbal.
http://mirror.nexcess.net/apache/pig/pig-0.12.0/
pig-0.12.0.tar.gz
Release notes link
http://pig.apache.org/releases.html#Download
Script Explanation
Load the file into a variable by mentioning the delimiter (;) and Header name and its type.
Use comma to include more than one column data available in file.By Default , Pig loads files
delimited by tab. Need to explicitly mention type of delimiter character.
SampleRecord = LOAD /user/hduser/piginput/pigcsv
USING PigStorage(;) AS (Year:chararray);
Group the variable stored data by year
GroupByYear = GROUP SampleRecord BY Year;
Pig in Cloudera
Pig Editor in Cloudera are explained in my
blog.
http://kannandreams.wordpress.com/2013/12/03/pig-editor-in-cloudera/#!