BDA Unit-4-PPT
BDA Unit-4-PPT
Pig SQL
Procedural language Declarative language
Schema is optional Schema is mandatory
Limited opportunity for Query More opportunity for Query
optimization. optimization
Allows splits in the pipeline.
Allows developers to store data
anywhere in the pipeline.
Provides operators to perform ETL
(Extract, Transform, and Load)
functions.
Apache Pig Vs Hive
Pig Hive
Language Pig Latin. HiveQL
Created at Yahoo Facebook
Data flow language Query processing language
A procedural language and it fits Declarative language
in pipeline paradigm
Mostly for structured data
Handle structured, unstructured,
and semi-structured data
Applications of Apache Pig
2010 – It
2007 – Open graduated as
Sourced Via Apache top-
Apache level Project
Incubator
Architecture
Parser
$ ./pig –x local
$ ./pig -x mapreduce
Either of these commands gives you the Grunt shell
prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
Batch Mode
sh Command
Invoke any shell commands
grunt> sh shell_command parameters
grunt> sh ls
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Invoke any Hadoop File system Shell commands
grunt> fs File System command parameters
grunt> fs –ls
Found 3 items
drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data
drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data
Utility Commands
clear : clear the screen
grunt> clear
help : Provides help about the commands.
history : Displays a list of statements executed / used so
far since the Grunt sell is invoked .
set : Used to show/assign values to keys used in Pig.
quit : You can quit from the Grunt shell.
exec/run: Can execute Pig scripts
grunt> exec [–param param_name = param_value] [–param_file
file_name] script
kill : kill a job from the Grunt shell , grunt> kill JobId
Pig Latin
A Relation is the outermost structure data model. And it is a bag
where –
A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.
Processing Data:
Statements are the basic constructs
Statements work with relations
Statements include operators, expressions and schemas
Statements take a relation as input and produce another relation as
output except load and store
Student_data = LOAD
'student_data.txt‘
USING PigStorage(',') as ( id:int,
firstname:chararray,
lastname:chararray,
phone:chararray,
city:chararray );
Values for all the data types can be NULL and
treats null values in a similar way as SQL does
Operators
Component Description
Relation_name The relation in which we want to store the data.
Input file path Mention the HDFS directory where the file is stored
function A function from the set of load functions provided by
Apache Pig (BinStorage, JsonLoader, PigStorage,
TextLoader).
schema Define the schema of the data
We can define the required schema as follows
(column1 : data type, column2 : data type, column3 : data type);
Note: We load the data without specifying the schema.
In that case, the columns will be addressed as $01, $02,
etc…
grunt> student = LOAD
‘hdfs://localhost:9000/pig_data/student_data.txt' USING
PigStorage(',') as ( id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray );
The PigStorage() function:
It loads and stores data as structured text files. It takes a
delimiter using which each entity of a tuple is separated,
as a parameter. By default, it takes ‘\t’ as a parameter.
Store operator
COUNT_STAR
• It includes the NULL values.
Sum: to get the total of the numeric values of a column in a single-column
bag and ignores the null values.
DIFF:
Used to compare two bags (fields) in a tuple.
It takes two fields of a tuple as input and matches them.
If they match, it returns an empty bag.
If they do not match, it finds the elements that exist in one filed (bag) and not
found in the other, and returns these elements by wrapping them within a bag.
Generally the Diff() function compares two bags in a tuple.
SUBTRACT :
Used to subtract two bags.
It takes two bags as inputs and returns a bag which contains the tuples of the first
bag that are not in the second bag.
IsEmpty : Used to check if a bag or map is empty.
Size : Used to compute the number of elements based on any Pig data
type.
BagToString :
Used to concatenate the elements of a bag into a string.
While concatenating, we can place a delimiter between these values (optional).
Concat : Used to concatenate two or more expressions of the same type.
Tokenize :
Used to split a string (which contains a group of words) in a single tuple and
return a bag which contains the output of the split operation.
As a delimeter to the tokenize function, we can pass space [ ], double quote [" "],
coma [ , ], parenthesis [ () ], star [ * ].
Word Count Example:
lines = LOAD ‘data’ AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, count(words);
DUMP wordcount;
Load and Store functions
Used to determine how the data goes in and comes out of Pig.
It Can’t be used for store operation