0% found this document useful (0 votes)
14 views

BDA Unit-4-PPT

Apache Pig is an abstraction over MapReduce that simplifies data analysis in Hadoop using a high-level language called Pig Latin. It allows programmers to perform complex data manipulation tasks easily, supports user-defined functions, and handles both structured and unstructured data. Key features include a rich set of operators, optimization opportunities, and the ability to process large datasets efficiently.

Uploaded by

Devabn Nirmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

BDA Unit-4-PPT

Apache Pig is an abstraction over MapReduce that simplifies data analysis in Hadoop using a high-level language called Pig Latin. It allows programmers to perform complex data manipulation tasks easily, supports user-defined functions, and handles both structured and unstructured data. Key features include a rich set of operators, optimization opportunities, and the ability to process large datasets efficiently.

Uploaded by

Devabn Nirmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Apache Pig

What is Apache Pig?

 An abstraction over MapReduce


 Used to analyse larger sets of data representing them as data flows
 Performs all the data manipulation operations in Hadoop
 Provides a high-level language known as Pig Latin
 Programmers can develop their own functions for reading, writing, and
processing data
 Scripts are internally converted to Map and Reduce tasks done by Pig
Engine
Why Do We Need Apache Pig?

 Programmers can perform MapReduce tasks easily without having to type


complex codes in Java
 uses multi-query approach, thereby reducing the length of codes
 SQL-like language
 Provides many built-in operators and Data Types
Features of Pig
Rich set of operators
It provides many operators to perform operations like join,
sort, filer, etc
Ease of programming
Pig Latin is similar to SQL and it is easy to write a Pig script if
you are good at SQL.
Optimization opportunities

Apache Pig optimize their execution automatically, so the


programmers need to focus only on semantics of the
language.
Extensibility
Using the existing operators, users can develop their own
functions to read, process, and write data.
User-defined Functions
Invoke or embed them in Pig Scripts
Handles all kinds of data

Structured as well as unstructured.


Apache Pig Vs MapReduce

Apache Pig MapReduce


 Data flow language  Data processing paradigm
 High level language  Low level and rigid
 Performing a Join operation is  Difficult to perform a Join
simple operation between datasets
 Knowledge of SQL is sufficient
 Exposure to Java is mandatory
 No need for compilation.
 Have a long compilation process.
 Every Apache Pig operator is
converted internally into a
MapReduce job
Apache Pig Vs SQL

Pig SQL
 Procedural language  Declarative language
 Schema is optional  Schema is mandatory
 Limited opportunity for Query  More opportunity for Query
optimization. optimization
 Allows splits in the pipeline.
 Allows developers to store data
anywhere in the pipeline.
 Provides operators to perform ETL
(Extract, Transform, and Load)
functions.
Apache Pig Vs Hive

Pig Hive
 Language Pig Latin.  HiveQL
 Created at Yahoo  Facebook
 Data flow language  Query processing language
 A procedural language and it fits  Declarative language
in pipeline paradigm
 Mostly for structured data
 Handle structured, unstructured,
and semi-structured data
Applications of Apache Pig

 Tasks involving ad-hoc processing and quick prototyping

 To process huge data sources such as web logs

 To perform data processing for search platforms

 To process time sensitive data loads


History
2006 –
Developed as 2008 – The first
a research release of
project at Apache Pig
Yahoo came

2010 – It
2007 – Open graduated as
Sourced Via Apache top-
Apache level Project
Incubator
Architecture
Parser

 Checks the syntax of the script - type checking


 Output of the parser will be a DAG, represents Pig Latin
statements and logical operators
Optimizer

 The logical plan (DAG) is passed to the logical optimizer,


which carries out the logical optimizations such as
projection and pushdown.
Compiler and Execution engine

 The compiler compiles the optimized logical plan into a


series of MapReduce jobs.
 Finally the MapReduce jobs are submitted to Hadoop in
a sorted order for execution to produce desired results.
Data Model
 Atom  Tuple
 Any single value - irrespective of  A record that is formed by an
their data type - Atom. ordered set of fields is known as a
tuple, the fields can be of any
 It is stored as string and can be
type.
used as string and number.
 A tuple is similar to a row in a
 int, long, float, double, chararray,
table of RDBMS.
and bytearray are the atomic
values of Pig.  Example: (Raja, 30)
 A piece of data or a simple
atomic value is known as a field.
 Example: ‘raja’ or ‘30’
 Bag  Relation
 Unordered set of tuples or a  A bag of tuples.
collection of tuples
 Unordered - No guarantee that
 Tuple can have any number of tuples are processed in any
fields (flexible schema). particular order.
 Represented by ‘{}’.  Map
 Similar to a table in RDBMS  A map (or data map) is a set of
 Not necessary - tuples contain the key-value pairs.
same number of fields and have  The key needs to be of type
the same type. chararray and should be unique.
 Example:  The value might be of any type. It
{(Raja, 30), (Mohammad, 45)} is represented by ‘[]’

 Can be a field in a relation - inner  Example: [name#Raja, age#30]


bag.
 Example: {Raja, 30, {9848022338,
[email protected],}}
Execution Modes

 Local Mode  MapReduce Mode


 Run from your local host and  Load or process the data that
local file system exists in the Hadoop File
 Used for testing purpose System (HDFS)
 A MapReduce job is invoked
in the back-end to perform a
particular operation on the
data
Execution Mechanisms

Interactive Mode (Grunt shell)


Batch Mode (Script)
Embedded Mode (UDF)
Defining our own functions (User Defined Functions) in
programming languages such as Java, and using
them in our script.
Invoking the Grunt Shell

$ ./pig –x local
$ ./pig -x mapreduce
Either of these commands gives you the Grunt shell
prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
Batch Mode

Write an entire Pig Latin script in a file and


execute it using the –x command.

$ pig -x local Sample_script.pig

$ pig -x mapreduce Sample_script.pig


Shell & Utility commands

sh Command
Invoke any shell commands
grunt> sh shell_command parameters
grunt> sh ls
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Invoke any Hadoop File system Shell commands
grunt> fs File System command parameters
grunt> fs –ls
Found 3 items
drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data
drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data
Utility Commands
 clear : clear the screen
 grunt> clear
 help : Provides help about the commands.
 history : Displays a list of statements executed / used so
far since the Grunt sell is invoked .
 set : Used to show/assign values to keys used in Pig.
 quit : You can quit from the Grunt shell.
 exec/run: Can execute Pig scripts
 grunt> exec [–param param_name = param_value] [–param_file
file_name] script
 kill : kill a job from the Grunt shell , grunt> kill JobId
Pig Latin
 A Relation is the outermost structure data model. And it is a bag
where –
 A bag is a collection of tuples.
 A tuple is an ordered set of fields.
 A field is a piece of data.
 Processing Data:
 Statements are the basic constructs
 Statements work with relations
 Statements include operators, expressions and schemas
 Statements take a relation as input and produce another relation as
output except load and store
Student_data = LOAD
'student_data.txt‘
USING PigStorage(',') as ( id:int,
firstname:chararray,
lastname:chararray,
phone:chararray,
city:chararray );
 Values for all the data types can be NULL and
treats null values in a similar way as SQL does
Operators

Category Operators Example


Arithmetic +, - , * , / , % , b = (a == 1)? 20: 30;
?: (Bincond Operator)
CASE CASE f2 % 2
WHEN THEN WHEN 0 THEN 'even'
ELSE WHEN 1 THEN 'odd'
END END

Comparison ==, !=, >, <, >=,<= f1 matches '.*tutorial.*'


matches (Pattern Matching)
Type Tuple Construction operator : () (Raju, 30)
Construction Bag Construction operator: {} {(Raju, 30), (Mohammad,
Map Construction operator: [] 45)}
[name#Raja, age#30]
Relational operators
Preparing Data
 In MapReduce mode, Pig reads (loads) data from HDFS and stores the
results back in HDFS. Therefore, let us start HDFS and create the following
sample data in HDFS.
Load Operator
 The load statement consists of two parts divided by the “=” operator.
 On the left-hand side, we need to mention the name of the relation where
we want to store the data, and on the right-hand side, we have to define
how we store the data.
 Given below is the syntax of the Load operator.
 Relation_name = LOAD 'Input file path' USING function as schema;

Component Description
Relation_name The relation in which we want to store the data.
Input file path Mention the HDFS directory where the file is stored
function A function from the set of load functions provided by
Apache Pig (BinStorage, JsonLoader, PigStorage,
TextLoader).
schema Define the schema of the data
 We can define the required schema as follows
 (column1 : data type, column2 : data type, column3 : data type);
 Note: We load the data without specifying the schema.
In that case, the columns will be addressed as $01, $02,
etc…
 grunt> student = LOAD
‘hdfs://localhost:9000/pig_data/student_data.txt' USING
PigStorage(',') as ( id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray );
 The PigStorage() function:
 It loads and stores data as structured text files. It takes a
delimiter using which each entity of a tuple is separated,
as a parameter. By default, it takes ‘\t’ as a parameter.
Store operator

 STORE Relation_name INTO ' required_directory_path '


[USING function];
 Ex:
 grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage
(',');
Diagnostic Operators
 Dump Operator
 The Dump operator is used to run the Pig Latin statements and
display the results on the screen. It is generally used for
debugging Purpose.
 grunt> Dump Relation_Name;
Ex: Dump student;
 Once you execute the above Pig Latin statement, it will start a
MapReduce job to read data from HDFS.
 Describe: Used to view the schema of a relation
 grunt> Describe Relation_name
 Ex: grunt> describe student;
Output: student: { id: int,firstname: chararray,lastname: chararray,phone:
chararray, city: chararray }
 Explain: Used to display the logical, physical, and MapReduce
execution plans of a relation.
 grunt> explain Relation_name;
 Ex: grunt> explain student;
 Illustrate: Gives you the step-by-step execution of a sequence of
statements
 grunt> illustrate Relation_name;
 grunt> illustrate student;
Group Operator
 The group operator is used to group the data in one or more relations. It
collects the data having the same key.
 Group_data = GROUP Relation_name BY age;
 grunt> group_data = GROUP student_details by age;
 grunt> Dump group_data;
 The output contains two columns: one is age – with which we have grouped the
relation, and the other is Bag – Which contains group of tuples, student records
with the respective age.
 You can see the schema of the table after grouping the data using the
describe command as shown below.
Cogroup Operator
 group operator is normally used with one relation, while the cogroup
operator is used in statements involving two or more relations.
The cogroup operator groups the tuples from
each schema according to age where each
group depicts a particular age value.
For example, if we consider the 1st tuple of the
result, it is grouped by age 21. And it contains
two bags –
the first bag holds all the tuples from the first schema
(student_details in this case) having age 21, and
the second bag contains all the tuples from the
second schema (employee_details in this case)
having age 21.
In case a schema doesn’t have tuples having the
age value 21, it returns an empty bag.
Join Operator
 The join operator is used to combine records from two or more relations.
 While performing a join operation, we declare one (or a group of) tuple(s)
from each relation, as keys.
 When these keys match, the two particular tuples are matched, else the
records are dropped.
 Joins can be of the following types:
 Self-join
 Inner-join
 Outer-join : left join, right join, and full join
 Self-join is used to join a table with itself as if the table
were two relations, temporarily renaming at least one
relation.
 Generally, in Apache Pig, to perform self-join, we will
load the same data multiple times, under different
aliases (names).
Outer Join
 Returns all the rows from at least one of the relations. An outer join
operation is carried out in three ways – Left, Right, and Full.
 left outer Join operation returns all rows from the left table, even if there are
no matches in the right relation.
 right outer join operation returns all rows from the right table, even if there
are no matches in the left table.
 full outer join operation returns rows when there is a match in one of the
relations.
Cross Operator
 Computes the cross-product of two or more relations.
Combining and Splitting
 Union Operator :
 The UNION operator of Pig Latin is used to merge the content of two relations.
 To perform UNION operation on two relations, their columns and domains must
be identical.
 Split : Used to split a relation into two or more relations.
Filter Operator
 Used to select the required tuples from a relation based on a condition.
Distinct Operator
 Used to remove redundant (duplicate) tuples from a relation
Foreach Operator
 Used to generate specified data transformations based on the column
data.
Order By
 Used to display the contents of a relation in a sorted order based on one or
more fields.
Limit Operator
 Used to get a limited number of tuples from a relation
Built-in Functions – EVAL Functions
 AVG: Used to compute the average of the numerical values within a bag
and ignores the NULL values.
 To get the global average value, we need to perform a Group All operation,
and calculate the average value using the AVG function.
 To get the average value of a group, we need to group it using the Group By
operator and proceed with the average function.
 Max - Used to calculate the highest value for a column (numeric values or
chararrays) in a single-column bag and ignores the NULL values.
 Count:
 Used to get the number of elements in a bag.
 While counting the number of tuples in a bag, the count() function ignores (will
not count) the tuples having a NULL value in the FIRST FIELD.

 COUNT_STAR
• It includes the NULL values.
 Sum: to get the total of the numeric values of a column in a single-column
bag and ignores the null values.
 DIFF:
 Used to compare two bags (fields) in a tuple.
 It takes two fields of a tuple as input and matches them.
 If they match, it returns an empty bag.
 If they do not match, it finds the elements that exist in one filed (bag) and not
found in the other, and returns these elements by wrapping them within a bag.
 Generally the Diff() function compares two bags in a tuple.
 SUBTRACT :
 Used to subtract two bags.
 It takes two bags as inputs and returns a bag which contains the tuples of the first
bag that are not in the second bag.
 IsEmpty : Used to check if a bag or map is empty.
 Size : Used to compute the number of elements based on any Pig data
type.
 BagToString :
 Used to concatenate the elements of a bag into a string.
 While concatenating, we can place a delimiter between these values (optional).
 Concat : Used to concatenate two or more expressions of the same type.
 Tokenize :
 Used to split a string (which contains a group of words) in a single tuple and
return a bag which contains the output of the split operation.
 As a delimeter to the tokenize function, we can pass space [ ], double quote [" "],
coma [ , ], parenthesis [ () ], star [ * ].
 Word Count Example:
 lines = LOAD ‘data’ AS (line:chararray);
 words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
 grouped = GROUP words BY word;
 wordcount = FOREACH grouped GENERATE group, count(words);
 DUMP wordcount;
Load and Store functions
 Used to determine how the data goes in and comes out of Pig.
It Can’t be used for store operation

BinStorge() in Pig is generally used to store temporary data generated


between the MapReduce jobs.
 Handling Compression: Compressed files can read using PigStorage and
TextLoader functions.
Bag and Tuple Functions
 TOBAG :
 Converts one or more expressions to individual tuples. And these tuples are
placed in a bag.
 TOP : Used to get the top N tuples of a bag.
 To this function, as inputs, we have to pass a relation, the number of tuples we
want, and the column name whose values are being compared.
 This function will return a bag containing the required columns.
 TOTUPLE : Used convert one or more expressions to the data type tuple.
 TOMAP : Used to convert the key-value pairs into a Map.
String Functions
Operator Description
ENDSWITH ENDSWITH(string, testAgainst)
To verify whether a given string ends with a particular
substring
STARTSWITH STARTSWITH(string, substring)
Accepts two string parameters and verifies whether the
first string starts with the second.
SUBSTRING SUBSTRING(string, startIndex, stopIndex)
Returns a substring from a given string.

EqualsIgnoreCase EqualsIgnoreCase(string1, string2)


To compare two stings ignoring the case.

INDEXOF INDEXOF(string, ‘character’, startIndex)


Returns the first occurrence of a character in a string,
searching forward from a start index.
Operator Description
LAST_INDEX_OF LAST_INDEX_OF(expression)
Returns the index of the last occurrence of a character in a
string, searching backward from a start index.
LCFIRST /UCFIRST LCFIRST(expression) /UCFIRST(expression)
Converts the first character in a string to lower case /Upper
case.
REPLACE REPLACE(string, ‘oldChar’, ‘newChar’);
To replace existing characters in a string with new characters.

UPPER / LOWER UPPER(expression) / LOWER(expression)


Returns a string converted to upper/lower case.
STRSPLIT STRSPLIT(string, regex, limit)
To split a string around matches of a given regular expression.
SPLITTOBAG SPLITTOBAG(string, regex, limit)
Similar to the STRSPLIT() function, it splits the string by given
delimiter and returns the result in a bag.
TRIM /LTRIM/RTRIM TRIM(expression) /LTRIM(expression)/RTRIM(expression) Returns
a copy of a string with leading and trailing / leading/ trailing
whitespaces removed.
Date and Time functions
 ToDate : Used to generate a DateTime object according to the given
parameters.
 ToDate(milliseconds)
 ToDate(userstring, format)
 ToDate(userstring, format, timezone)
Math functions
 ABS, ACOS, ATAN, ASIN, CBRT, CEIL, COS, COSH, EXP, FLOOR, LOG, LOG10,
RANDOM, ROUND, SIN, SINH, SQRT, TAN, TANH
 Ex:
Running Scripts
 how to run Apache Pig scripts in batch mode ?
 Comments in Pig Script :
 /* */ - Multiline comment , -- - Single line comment
 Executing Pig Script in Batch mode
Step 1
Write all the required Pig Latin statements in a single file. We can write all the
Pig Latin statements and commands in a single file and save it as .pig file.
Step 2
Execute the Apache Pig script. You can execute the Pig script from the shell
(Linux) as shown below.
 You can execute it from the Grunt shell as well using the exec command as
shown below.
grunt> exec /sample_script.pig
 Executing a Pig Script from HDFS :
 Suppose there is a Pig script with the name Sample_script.pig in the HDFS directory
named /pig_data/.
 $ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig

You might also like