0% found this document useful (0 votes)
53 views

BDH Exp-4 I232

The document provides information about an Apache Pig hands-on lab experiment. It discusses the aim of the experiment to use Apache Pig to find the most occurred starting letter in a data file. It then provides steps to load the data, tokenize it, extract the first letter of each word, group by letter, count occurrences, order by count, limit to the highest count, and store the result. The document also provides background information on Apache Pig, its data types, and how to create Pig UDFs (user defined functions) with examples in Java.

Uploaded by

Namra Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

BDH Exp-4 I232

The document provides information about an Apache Pig hands-on lab experiment. It discusses the aim of the experiment to use Apache Pig to find the most occurred starting letter in a data file. It then provides steps to load the data, tokenize it, extract the first letter of each word, group by letter, count occurrences, order by count, limit to the highest count, and store the result. The document also provides background information on Apache Pig, its data types, and how to create Pig UDFs (user defined functions) with examples in Java.

Uploaded by

Namra Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Name: Namra Shah

Class: MBATech (IT)


Roll Number: I-232
SAP ID: 70411119035
Batch: B

EXP4

Aim: Apache Pig hands on lab


Prerequisites:
Java is the primary requirement for running Hadoop on any system, So make sure you have
Java installed on your system.

Theory:

What is Apache Pig


Apache Pig is a high-level data flow platform for executing MapReduce programs of
Hadoop. The language used for Pig is Pig Latin.

The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored
in HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark.

Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores the
corresponding results into Hadoop Data File System. Every task which can be achieved using
PIG can also be achieved using java used in MapReduce.

Differences between Apache MapReduce and PIG

Apache MapReduce Apache PIG

It is a low-level data processing tool. It is a high-level data flow tool.

Here, it is required to develop complex It is not required to develop complex programs.


programs using Java or Python.

It is difficult to perform data operations in It provides built-in operators to perform data operations
MapReduce. like union, sorting and ordering.

It doesn't allow nested data types. It provides nested data types like tuple, bag, and map.
Pig Data Types
Apache Pig supports many data types. A list of Apache Pig Data Types with description and
examples are given below.

Type Description Example

Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2

charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apache]

Pig Example
Use case: Using Pig find the most occurred start letter.

Solution:

Case 1: Load the data into bag named "lines". The entire line is stuck to element line of type
character array.

1. grunt> lines  = LOAD "/user/Desktop/data.txt" AS (line: chararray);  

Case 2: The text in the bag lines needs to be tokenized this produces one word per row.

1. grunt>tokens = FOREACH lines GENERATE flatten(TOKENIZE(line))   As token: c
hararray;  
Case 3: To retain the first letter of each word type the below command .This commands uses
substring method to take the first character.

1. grunt>letters = FOREACH tokens  GENERATE SUBSTRING(0,1)   as letter : charar
ray;  

Case 4: Create a bag for unique character where the grouped bag will contain the same
character for each occurrence of that character.

1. grunt>lettergrp = GROUP letters by letter;  

Case 5: The number of occurrence is counted in each group.

1. grunt>countletter  = FOREACH  lettergrp  GENERATE group  , COUNT(letters);  

Case 6: Arrange the output according to count in descending order using the commands
below.

1. grunt>OrderCnt = ORDER countletter  BY  $1  DESC;  

Case 7: Limit to One to give the result.

1. grunt> result  =LIMIT    OrderCnt    1;  

Case 8: Store the result in HDFS . The result is saved in output directory under sonoo folder.

1. grunt> STORE   result   into 'home/sonoo/output';  

Pig UDF (User Defined Functions)


To specify custom processing, Pig provides support for user-defined functions (UDFs). Thus,
Pig allows us to create our own functions. Currently, Pig UDFs can be implemented using the
following programming languages: -

o Java
o Python
o Jython
o JavaScript
o Ruby
o Groovy
Among all the languages, Pig provides the most extensive support for Java functions.
However, limited support is provided to languages like Python, Jython, JavaScript, Ruby, and
Groovy.

Example of Pig UDF


In Pig,

o All UDFs must extend "org.apache.pig.EvalFunc"


o All functions must override the "exec" method.

Let's see an example of a simple EVAL Function to convert the provided string to uppercase.

7M
139
Java Try Catch

UPPER.java

1. package com.hadoop;  
2.   
3. import java.io.IOException;  
4.   
5. import org.apache.pig.EvalFunc;  
6. import org.apache.pig.data.Tuple;  
7.   
8. public class TestUpper extends EvalFunc<String>   {  
9.     public String exec(Tuple input) throws IOException {    
10.         if (input == null || input.size() == 0)    
11.         return null;    
12.         try{    
13.                     String str = (String)input.get(0);    
14.         return str.toUpperCase();    
15.         }catch(Exception e){    
16.         throw new IOException("Caught exception processing input row ", e);    
17.                 }    
18.             }  
19. }  

o Create the jar file and export it into the specific directory. For that ,right click on
project - Export - Java - JAR file - Next.
o Now, provide a specific name to the jar file and save it in a local system directory.
o Create a text file in your local machine and insert the list of tuples.

1. $ nano pigsample  

o Upload the text files on HDFS in the specific directory.

1. $ hdfs dfs -put pigexample /pigexample  

o Create a pig file in your local machine and write the script.

1. $ nano pscript.pig  

o Now, run the script in the terminal to get the output.

1. $pig pscript.pig  
Web link: Pig Tutorial - javatpoint

Execute pig commands/ program and paste here screen shot of


command and output

WORD COUNT PROGRAM

You might also like