0% found this document useful (0 votes)

14 views

BDA Unit-4-PPT

Apache Pig is an abstraction over MapReduce that simplifies data analysis in Hadoop using a high-level language called Pig Latin. It allows programmers to perform complex data manipulation tasks easily, supports user-defined functions, and handles both structured and unstructured data. Key features include a rich set of operators, optimization opportunities, and the ability to process large datasets efficiently.

Uploaded by

Devabn Nirmal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

BDA Unit-4-PPT

Uploaded by

Devabn Nirmal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Apache Pig

What is Apache Pig?

 An abstraction over MapReduce

 Used to analyse larger sets of data representing them as data flows
 Performs all the data manipulation operations in Hadoop
 Provides a high-level language known as Pig Latin
 Programmers can develop their own functions for reading, writing, and
processing data
 Scripts are internally converted to Map and Reduce tasks done by Pig
Engine
Why Do We Need Apache Pig?

 Programmers can perform MapReduce tasks easily without having to type

complex codes in Java
 uses multi-query approach, thereby reducing the length of codes
 SQL-like language
 Provides many built-in operators and Data Types
Features of Pig
Rich set of operators
It provides many operators to perform operations like join,
sort, filer, etc
Ease of programming
Pig Latin is similar to SQL and it is easy to write a Pig script if
you are good at SQL.
Optimization opportunities

Apache Pig optimize their execution automatically, so the

programmers need to focus only on semantics of the
language.
Extensibility
Using the existing operators, users can develop their own
functions to read, process, and write data.
User-defined Functions
Invoke or embed them in Pig Scripts
Handles all kinds of data

Structured as well as unstructured.

Apache Pig Vs MapReduce

Apache Pig MapReduce

 Data flow language  Data processing paradigm
 High level language  Low level and rigid
 Performing a Join operation is  Difficult to perform a Join
simple operation between datasets
 Knowledge of SQL is sufficient
 Exposure to Java is mandatory
 No need for compilation.
 Have a long compilation process.
 Every Apache Pig operator is
converted internally into a
MapReduce job
Apache Pig Vs SQL

Pig SQL
 Procedural language  Declarative language
 Schema is optional  Schema is mandatory
 Limited opportunity for Query  More opportunity for Query
optimization. optimization
 Allows splits in the pipeline.
 Allows developers to store data
anywhere in the pipeline.
 Provides operators to perform ETL
(Extract, Transform, and Load)
functions.
Apache Pig Vs Hive

Pig Hive
 Language Pig Latin.  HiveQL
 Created at Yahoo  Facebook
 Data flow language  Query processing language
 A procedural language and it fits  Declarative language
in pipeline paradigm
 Mostly for structured data
 Handle structured, unstructured,
and semi-structured data
Applications of Apache Pig

 Tasks involving ad-hoc processing and quick prototyping

 To process huge data sources such as web logs

 To perform data processing for search platforms

 To process time sensitive data loads

History
2006 –
Developed as 2008 – The first
a research release of
project at Apache Pig
Yahoo came

2010 – It
2007 – Open graduated as
Sourced Via Apache top-
Apache level Project
Incubator
Architecture
Parser

 Checks the syntax of the script - type checking

 Output of the parser will be a DAG, represents Pig Latin
statements and logical operators
Optimizer

 The logical plan (DAG) is passed to the logical optimizer,

which carries out the logical optimizations such as
projection and pushdown.
Compiler and Execution engine

 The compiler compiles the optimized logical plan into a

series of MapReduce jobs.
 Finally the MapReduce jobs are submitted to Hadoop in
a sorted order for execution to produce desired results.
Data Model
 Atom  Tuple
 Any single value - irrespective of  A record that is formed by an
their data type - Atom. ordered set of fields is known as a
tuple, the fields can be of any
 It is stored as string and can be
type.
used as string and number.
 A tuple is similar to a row in a
 int, long, float, double, chararray,
table of RDBMS.
and bytearray are the atomic
values of Pig.  Example: (Raja, 30)
 A piece of data or a simple
atomic value is known as a field.
 Example: ‘raja’ or ‘30’
 Bag  Relation
 Unordered set of tuples or a  A bag of tuples.
collection of tuples
 Unordered - No guarantee that
 Tuple can have any number of tuples are processed in any
fields (flexible schema). particular order.
 Represented by ‘{}’.  Map
 Similar to a table in RDBMS  A map (or data map) is a set of
 Not necessary - tuples contain the key-value pairs.
same number of fields and have  The key needs to be of type
the same type. chararray and should be unique.
 Example:  The value might be of any type. It
{(Raja, 30), (Mohammad, 45)} is represented by ‘[]’

 Can be a field in a relation - inner  Example: [name#Raja, age#30]

bag.
 Example: {Raja, 30, {9848022338,
[email protected],}}
Execution Modes

 Local Mode  MapReduce Mode

 Run from your local host and  Load or process the data that
local file system exists in the Hadoop File
 Used for testing purpose System (HDFS)
 A MapReduce job is invoked
in the back-end to perform a
particular operation on the
data
Execution Mechanisms

Interactive Mode (Grunt shell)

Batch Mode (Script)
Embedded Mode (UDF)
Defining our own functions (User Defined Functions) in
programming languages such as Java, and using
them in our script.
Invoking the Grunt Shell

$ ./pig –x local
$ ./pig -x mapreduce
Either of these commands gives you the Grunt shell
prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
Batch Mode

Write an entire Pig Latin script in a file and

execute it using the –x command.

$ pig -x local Sample_script.pig

$ pig -x mapreduce Sample_script.pig

Shell & Utility commands

sh Command
Invoke any shell commands
grunt> sh shell_command parameters
grunt> sh ls
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Invoke any Hadoop File system Shell commands
grunt> fs File System command parameters
grunt> fs –ls
Found 3 items
drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data
drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data
Utility Commands
 clear : clear the screen
 grunt> clear
 help : Provides help about the commands.
 history : Displays a list of statements executed / used so
far since the Grunt sell is invoked .
 set : Used to show/assign values to keys used in Pig.
 quit : You can quit from the Grunt shell.
 exec/run: Can execute Pig scripts
 grunt> exec [–param param_name = param_value] [–param_file
file_name] script
 kill : kill a job from the Grunt shell , grunt> kill JobId
Pig Latin
 A Relation is the outermost structure data model. And it is a bag
where –
 A bag is a collection of tuples.
 A tuple is an ordered set of fields.
 A field is a piece of data.
 Processing Data:
 Statements are the basic constructs
 Statements work with relations
 Statements include operators, expressions and schemas
 Statements take a relation as input and produce another relation as
output except load and store
Student_data = LOAD
'student_data.txt‘
USING PigStorage(',') as ( id:int,
firstname:chararray,
lastname:chararray,
phone:chararray,
city:chararray );
 Values for all the data types can be NULL and
treats null values in a similar way as SQL does
Operators

Category Operators Example

Arithmetic +, - , * , / , % , b = (a == 1)? 20: 30;
?: (Bincond Operator)
CASE CASE f2 % 2
WHEN THEN WHEN 0 THEN 'even'
ELSE WHEN 1 THEN 'odd'
END END

Comparison ==, !=, >, <, >=,<= f1 matches '.tutorial.'

matches (Pattern Matching)
Type Tuple Construction operator : () (Raju, 30)
Construction Bag Construction operator: {} {(Raju, 30), (Mohammad,
Map Construction operator: [] 45)}
[name#Raja, age#30]
Relational operators
Preparing Data
 In MapReduce mode, Pig reads (loads) data from HDFS and stores the
results back in HDFS. Therefore, let us start HDFS and create the following
sample data in HDFS.
Load Operator
 The load statement consists of two parts divided by the “=” operator.
 On the left-hand side, we need to mention the name of the relation where
we want to store the data, and on the right-hand side, we have to define
how we store the data.
 Given below is the syntax of the Load operator.
 Relation_name = LOAD 'Input file path' USING function as schema;

Component Description
Relation_name The relation in which we want to store the data.
Input file path Mention the HDFS directory where the file is stored
function A function from the set of load functions provided by
Apache Pig (BinStorage, JsonLoader, PigStorage,
TextLoader).
schema Define the schema of the data
 We can define the required schema as follows
 (column1 : data type, column2 : data type, column3 : data type);
 Note: We load the data without specifying the schema.
In that case, the columns will be addressed as $01, $02,
etc…
 grunt> student = LOAD
‘hdfs://localhost:9000/pig_data/student_data.txt' USING
PigStorage(',') as ( id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray );
 The PigStorage() function:
 It loads and stores data as structured text files. It takes a
delimiter using which each entity of a tuple is separated,
as a parameter. By default, it takes ‘\t’ as a parameter.
Store operator

 STORE Relation_name INTO ' required_directory_path '

[USING function];
 Ex:
 grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage
(',');
Diagnostic Operators
 Dump Operator
 The Dump operator is used to run the Pig Latin statements and
display the results on the screen. It is generally used for
debugging Purpose.
 grunt> Dump Relation_Name;
Ex: Dump student;
 Once you execute the above Pig Latin statement, it will start a
MapReduce job to read data from HDFS.
 Describe: Used to view the schema of a relation
 grunt> Describe Relation_name
 Ex: grunt> describe student;
Output: student: { id: int,firstname: chararray,lastname: chararray,phone:
chararray, city: chararray }
 Explain: Used to display the logical, physical, and MapReduce
execution plans of a relation.
 grunt> explain Relation_name;
 Ex: grunt> explain student;
 Illustrate: Gives you the step-by-step execution of a sequence of
statements
 grunt> illustrate Relation_name;
 grunt> illustrate student;
Group Operator
 The group operator is used to group the data in one or more relations. It
collects the data having the same key.
 Group_data = GROUP Relation_name BY age;
 grunt> group_data = GROUP student_details by age;
 grunt> Dump group_data;
 The output contains two columns: one is age – with which we have grouped the
relation, and the other is Bag – Which contains group of tuples, student records
with the respective age.
 You can see the schema of the table after grouping the data using the
describe command as shown below.
Cogroup Operator
 group operator is normally used with one relation, while the cogroup
operator is used in statements involving two or more relations.
The cogroup operator groups the tuples from
each schema according to age where each
group depicts a particular age value.
For example, if we consider the 1st tuple of the
result, it is grouped by age 21. And it contains
two bags –
the first bag holds all the tuples from the first schema
(student_details in this case) having age 21, and
the second bag contains all the tuples from the
second schema (employee_details in this case)
having age 21.
In case a schema doesn’t have tuples having the
age value 21, it returns an empty bag.
Join Operator
 The join operator is used to combine records from two or more relations.
 While performing a join operation, we declare one (or a group of) tuple(s)
from each relation, as keys.
 When these keys match, the two particular tuples are matched, else the
records are dropped.
 Joins can be of the following types:
 Self-join
 Inner-join
 Outer-join : left join, right join, and full join
 Self-join is used to join a table with itself as if the table
were two relations, temporarily renaming at least one
relation.
 Generally, in Apache Pig, to perform self-join, we will
load the same data multiple times, under different
aliases (names).
Outer Join
 Returns all the rows from at least one of the relations. An outer join
operation is carried out in three ways – Left, Right, and Full.
 left outer Join operation returns all rows from the left table, even if there are
no matches in the right relation.
 right outer join operation returns all rows from the right table, even if there
are no matches in the left table.
 full outer join operation returns rows when there is a match in one of the
relations.
Cross Operator
 Computes the cross-product of two or more relations.
Combining and Splitting
 Union Operator :
 The UNION operator of Pig Latin is used to merge the content of two relations.
 To perform UNION operation on two relations, their columns and domains must
be identical.
 Split : Used to split a relation into two or more relations.
Filter Operator
 Used to select the required tuples from a relation based on a condition.
Distinct Operator
 Used to remove redundant (duplicate) tuples from a relation
Foreach Operator
 Used to generate specified data transformations based on the column
data.
Order By
 Used to display the contents of a relation in a sorted order based on one or
more fields.
Limit Operator
 Used to get a limited number of tuples from a relation
Built-in Functions – EVAL Functions
 AVG: Used to compute the average of the numerical values within a bag
and ignores the NULL values.
 To get the global average value, we need to perform a Group All operation,
and calculate the average value using the AVG function.
 To get the average value of a group, we need to group it using the Group By
operator and proceed with the average function.
 Max - Used to calculate the highest value for a column (numeric values or
chararrays) in a single-column bag and ignores the NULL values.
 Count:
 Used to get the number of elements in a bag.
 While counting the number of tuples in a bag, the count() function ignores (will
not count) the tuples having a NULL value in the FIRST FIELD.

 COUNT_STAR
• It includes the NULL values.
 Sum: to get the total of the numeric values of a column in a single-column
bag and ignores the null values.
 DIFF:
 Used to compare two bags (fields) in a tuple.
 It takes two fields of a tuple as input and matches them.
 If they match, it returns an empty bag.
 If they do not match, it finds the elements that exist in one filed (bag) and not
found in the other, and returns these elements by wrapping them within a bag.
 Generally the Diff() function compares two bags in a tuple.
 SUBTRACT :
 Used to subtract two bags.
 It takes two bags as inputs and returns a bag which contains the tuples of the first
bag that are not in the second bag.
 IsEmpty : Used to check if a bag or map is empty.
 Size : Used to compute the number of elements based on any Pig data
type.
 BagToString :
 Used to concatenate the elements of a bag into a string.
 While concatenating, we can place a delimiter between these values (optional).
 Concat : Used to concatenate two or more expressions of the same type.
 Tokenize :
 Used to split a string (which contains a group of words) in a single tuple and
return a bag which contains the output of the split operation.
 As a delimeter to the tokenize function, we can pass space [ ], double quote [" "],
coma [ , ], parenthesis [ () ], star [ * ].
 Word Count Example:
 lines = LOAD ‘data’ AS (line:chararray);
 words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
 grouped = GROUP words BY word;
 wordcount = FOREACH grouped GENERATE group, count(words);
 DUMP wordcount;
Load and Store functions
 Used to determine how the data goes in and comes out of Pig.
It Can’t be used for store operation

BinStorge() in Pig is generally used to store temporary data generated

between the MapReduce jobs.
 Handling Compression: Compressed files can read using PigStorage and
TextLoader functions.
Bag and Tuple Functions
 TOBAG :
 Converts one or more expressions to individual tuples. And these tuples are
placed in a bag.
 TOP : Used to get the top N tuples of a bag.
 To this function, as inputs, we have to pass a relation, the number of tuples we
want, and the column name whose values are being compared.
 This function will return a bag containing the required columns.
 TOTUPLE : Used convert one or more expressions to the data type tuple.
 TOMAP : Used to convert the key-value pairs into a Map.
String Functions
Operator Description
ENDSWITH ENDSWITH(string, testAgainst)
To verify whether a given string ends with a particular
substring
STARTSWITH STARTSWITH(string, substring)
Accepts two string parameters and verifies whether the
first string starts with the second.
SUBSTRING SUBSTRING(string, startIndex, stopIndex)
Returns a substring from a given string.

EqualsIgnoreCase EqualsIgnoreCase(string1, string2)

To compare two stings ignoring the case.

INDEXOF INDEXOF(string, ‘character’, startIndex)

Returns the first occurrence of a character in a string,
searching forward from a start index.
Operator Description
LAST_INDEX_OF LAST_INDEX_OF(expression)
Returns the index of the last occurrence of a character in a
string, searching backward from a start index.
LCFIRST /UCFIRST LCFIRST(expression) /UCFIRST(expression)
Converts the first character in a string to lower case /Upper
case.
REPLACE REPLACE(string, ‘oldChar’, ‘newChar’);
To replace existing characters in a string with new characters.

UPPER / LOWER UPPER(expression) / LOWER(expression)

Returns a string converted to upper/lower case.
STRSPLIT STRSPLIT(string, regex, limit)
To split a string around matches of a given regular expression.
SPLITTOBAG SPLITTOBAG(string, regex, limit)
Similar to the STRSPLIT() function, it splits the string by given
delimiter and returns the result in a bag.
TRIM /LTRIM/RTRIM TRIM(expression) /LTRIM(expression)/RTRIM(expression) Returns
a copy of a string with leading and trailing / leading/ trailing
whitespaces removed.
Date and Time functions
 ToDate : Used to generate a DateTime object according to the given
parameters.
 ToDate(milliseconds)
 ToDate(userstring, format)
 ToDate(userstring, format, timezone)
Math functions
 ABS, ACOS, ATAN, ASIN, CBRT, CEIL, COS, COSH, EXP, FLOOR, LOG, LOG10,
RANDOM, ROUND, SIN, SINH, SQRT, TAN, TANH
 Ex:
Running Scripts
 how to run Apache Pig scripts in batch mode ?
 Comments in Pig Script :
 /* */ - Multiline comment , -- - Single line comment
 Executing Pig Script in Batch mode
Step 1
Write all the required Pig Latin statements in a single file. We can write all the
Pig Latin statements and commands in a single file and save it as .pig file.
Step 2
Execute the Apache Pig script. You can execute the Pig script from the shell
(Linux) as shown below.
 You can execute it from the Grunt shell as well using the exec command as
shown below.
grunt> exec /sample_script.pig
 Executing a Pig Script from HDFS :
 Suppose there is a Pig script with the name Sample_script.pig in the HDFS directory
named /pig_data/.
 $ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig

Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Big_Data_Unit-5
No ratings yet
Big_Data_Unit-5
81 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Pig_2
No ratings yet
Pig_2
63 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
BDA-Unit 5-notes
No ratings yet
BDA-Unit 5-notes
36 pages
Apache Pig
100% (2)
Apache Pig
80 pages
3 Pig
No ratings yet
3 Pig
77 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
BDP U4
No ratings yet
BDP U4
58 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
6 part2
No ratings yet
6 part2
45 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
bda-unit-4-060115-big-data-analytics-unit-4
No ratings yet
bda-unit-4-060115-big-data-analytics-unit-4
19 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
06-Pig-01-Intro-1
No ratings yet
06-Pig-01-Intro-1
23 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
pig
No ratings yet
pig
23 pages
unit-4-apachepig-210825041412
No ratings yet
unit-4-apachepig-210825041412
16 pages
Unit 5(Pig,Hive,Hbase)
No ratings yet
Unit 5(Pig,Hive,Hbase)
18 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Unit 4
No ratings yet
Unit 4
29 pages
UNIT-5
No ratings yet
UNIT-5
24 pages
BDA unit5
No ratings yet
BDA unit5
36 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
Unit 5
No ratings yet
Unit 5
76 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
notes of aktu btech 3 yr big data
No ratings yet
notes of aktu btech 3 yr big data
15 pages
BD 5
No ratings yet
BD 5
28 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
Unit 5
No ratings yet
Unit 5
39 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Pig_Notes-1
No ratings yet
Pig_Notes-1
6 pages
BDA_UNIT_IV_NOTES (1)
No ratings yet
BDA_UNIT_IV_NOTES (1)
32 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
BDA_HIVE & PIG-Other Notes in Detail
No ratings yet
BDA_HIVE & PIG-Other Notes in Detail
162 pages
UNIT 3
No ratings yet
UNIT 3
26 pages
Scet Unit 5
No ratings yet
Scet Unit 5
9 pages
pig skb
No ratings yet
pig skb
7 pages
Module 4 - Pig
No ratings yet
Module 4 - Pig
65 pages
05a-pig
No ratings yet
05a-pig
52 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Apssdc Summer Internship-2025
No ratings yet
Apssdc Summer Internship-2025
20 pages
Building, Trustworthy Generative AI Systems brochure (3)
No ratings yet
Building, Trustworthy Generative AI Systems brochure (3)
6 pages
UNIT-5
No ratings yet
UNIT-5
14 pages
SAT CLASS - 2
No ratings yet
SAT CLASS - 2
17 pages
R25 DEPARTMENT VISION & MISSION
No ratings yet
R25 DEPARTMENT VISION & MISSION
3 pages
Digital Nurture Eligible students list
No ratings yet
Digital Nurture Eligible students list
10 pages
Programme Credit Framework
No ratings yet
Programme Credit Framework
14 pages
Challenging Problems Time Work Speed Distance (2)
No ratings yet
Challenging Problems Time Work Speed Distance (2)
40 pages
APF
No ratings yet
APF
9 pages
SAT CLASS - 5
No ratings yet
SAT CLASS - 5
18 pages
CO-PO Mapping
No ratings yet
CO-PO Mapping
11 pages
PPT on ESD UNIT 5
No ratings yet
PPT on ESD UNIT 5
31 pages
SAT CLASS - 4
No ratings yet
SAT CLASS - 4
6 pages
Sat Class - 26
No ratings yet
Sat Class - 26
15 pages
SAT CLASS -4
No ratings yet
SAT CLASS -4
2 pages
SAT CLASS -21
No ratings yet
SAT CLASS -21
4 pages
Transform Warehouse Security With AI Powered Surveillance
No ratings yet
Transform Warehouse Security With AI Powered Surveillance
9 pages
SAT CLASS - 1
No ratings yet
SAT CLASS - 1
13 pages
Sat Class - 14
No ratings yet
Sat Class - 14
7 pages
Sat Class - 12
No ratings yet
Sat Class - 12
11 pages
SAT CLASS -23
No ratings yet
SAT CLASS -23
4 pages
Sat Class - 19
No ratings yet
Sat Class - 19
15 pages
Sat Class - 11
No ratings yet
Sat Class - 11
13 pages
SAT CLASS -13
No ratings yet
SAT CLASS -13
2 pages
SAT CLASS -8
No ratings yet
SAT CLASS -8
2 pages
SAT CLASS -5
No ratings yet
SAT CLASS -5
4 pages
SAT CLASS -2
No ratings yet
SAT CLASS -2
3 pages
SAT CLASS -1
No ratings yet
SAT CLASS -1
3 pages
SAT CLASS -7
No ratings yet
SAT CLASS -7
2 pages
SAT CLASS -3
No ratings yet
SAT CLASS -3
2 pages
nom: Badache
No ratings yet
nom: Badache
9 pages
Automated Hospital Ward Management System Interacting With Mobile Robot Platform WDBOT
No ratings yet
Automated Hospital Ward Management System Interacting With Mobile Robot Platform WDBOT
7 pages
A Distributed Pi-Calculus - Hennessy
No ratings yet
A Distributed Pi-Calculus - Hennessy
279 pages
Welcome To Reddit,: An Easy, Safe But Time Consuming Way To Download Scribd Documents For Free
No ratings yet
Welcome To Reddit,: An Easy, Safe But Time Consuming Way To Download Scribd Documents For Free
1 page
Mphil Thesis in Computer Science PDF
100% (3)
Mphil Thesis in Computer Science PDF
4 pages
How to use PostgreSQL with Django _ EDB
No ratings yet
How to use PostgreSQL with Django _ EDB
13 pages
WG Pa41
No ratings yet
WG Pa41
12 pages
FDM 1 PDF
No ratings yet
FDM 1 PDF
9 pages
Jetson Orin NX Series Modules Data Sheet DS 10712 001 v1.1
No ratings yet
Jetson Orin NX Series Modules Data Sheet DS 10712 001 v1.1
54 pages
Linux OS Boot Sequence
No ratings yet
Linux OS Boot Sequence
11 pages
Flow Chart
No ratings yet
Flow Chart
9 pages
Base Shear Resistance Calculation - R1
No ratings yet
Base Shear Resistance Calculation - R1
8 pages
Top Capital Markets Software Companies - VentureRadar
No ratings yet
Top Capital Markets Software Companies - VentureRadar
7 pages
GK Trick by Nitin Gupta Part - 1
No ratings yet
GK Trick by Nitin Gupta Part - 1
72 pages
IntelliROL IOM 90480001 r06092015
100% (1)
IntelliROL IOM 90480001 r06092015
126 pages
20+ Best Figma Tutorials For Beginners Design System
No ratings yet
20+ Best Figma Tutorials For Beginners Design System
13 pages
Diane Win
100% (2)
Diane Win
15 pages
DiagSmart Operation Manual V3.2.15 20190524
100% (1)
DiagSmart Operation Manual V3.2.15 20190524
34 pages
Syllabus Web Devlopment
No ratings yet
Syllabus Web Devlopment
3 pages
1Z0-902-demo
No ratings yet
1Z0-902-demo
6 pages
Resume - Ashish Shakya
No ratings yet
Resume - Ashish Shakya
2 pages
Hon. Parveen Singh Sidharth Gupta: Eminar Eport
No ratings yet
Hon. Parveen Singh Sidharth Gupta: Eminar Eport
40 pages
Comparative Study of Symbian and Windows
No ratings yet
Comparative Study of Symbian and Windows
4 pages
PDS SIS LogicSolver
No ratings yet
PDS SIS LogicSolver
14 pages
Product Analyst
No ratings yet
Product Analyst
2 pages
7.214 Assignment 01
No ratings yet
7.214 Assignment 01
5 pages
Jharkhand (JSSC) Forest Guard Exam Question Paper
No ratings yet
Jharkhand (JSSC) Forest Guard Exam Question Paper
11 pages
Infrawork Road - Design - Workflow
No ratings yet
Infrawork Road - Design - Workflow
38 pages
Abhishek_Thite_980218_Final_thesis_report1
No ratings yet
Abhishek_Thite_980218_Final_thesis_report1
71 pages
Operators and Expressions: Programming in Ansi C
No ratings yet
Operators and Expressions: Programming in Ansi C
32 pages

BDA Unit-4-PPT

Uploaded by

BDA Unit-4-PPT

Uploaded by

Apache Pig

What is Apache Pig?

 An abstraction over MapReduce

 Programmers can perform MapReduce tasks easily without having to type

Apache Pig optimize their execution automatically, so the

Structured as well as unstructured.

Apache Pig MapReduce

 Tasks involving ad-hoc processing and quick prototyping

 To process huge data sources such as web logs

 To perform data processing for search platforms

 To process time sensitive data loads

 Checks the syntax of the script - type checking

 The logical plan (DAG) is passed to the logical optimizer,

 The compiler compiles the optimized logical plan into a

 Can be a field in a relation - inner  Example: [name#Raja, age#30]

 Local Mode  MapReduce Mode

Interactive Mode (Grunt shell)

Write an entire Pig Latin script in a file and

$ pig -x local Sample_script.pig

$ pig -x mapreduce Sample_script.pig

Category Operators Example

Comparison ==, !=, >, <, >=,<= f1 matches '.*tutorial.*'

 STORE Relation_name INTO ' required_directory_path '

BinStorge() in Pig is generally used to store temporary data generated

EqualsIgnoreCase EqualsIgnoreCase(string1, string2)

INDEXOF INDEXOF(string, ‘character’, startIndex)

UPPER / LOWER UPPER(expression) / LOWER(expression)

You might also like

Comparison ==, !=, >, <, >=,<= f1 matches '.tutorial.'