BDH Exp-4 I232
BDH Exp-4 I232
EXP4
Theory:
The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored
in HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark.
Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores the
corresponding results into Hadoop Data File System. Every task which can be achieved using
PIG can also be achieved using java used in MapReduce.
It is difficult to perform data operations in It provides built-in operators to perform data operations
MapReduce. like union, sorting and ordering.
It doesn't allow nested data types. It provides nested data types like tuple, bag, and map.
Pig Data Types
Apache Pig supports many data types. A list of Apache Pig Data Types with description and
examples are given below.
Pig Example
Use case: Using Pig find the most occurred start letter.
Solution:
Case 1: Load the data into bag named "lines". The entire line is stuck to element line of type
character array.
1. grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);
Case 2: The text in the bag lines needs to be tokenized this produces one word per row.
1. grunt>tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) As token: c
hararray;
Case 3: To retain the first letter of each word type the below command .This commands uses
substring method to take the first character.
1. grunt>letters = FOREACH tokens GENERATE SUBSTRING(0,1) as letter : charar
ray;
Case 4: Create a bag for unique character where the grouped bag will contain the same
character for each occurrence of that character.
1. grunt>lettergrp = GROUP letters by letter;
1. grunt>countletter = FOREACH lettergrp GENERATE group , COUNT(letters);
Case 6: Arrange the output according to count in descending order using the commands
below.
1. grunt>OrderCnt = ORDER countletter BY $1 DESC;
1. grunt> result =LIMIT OrderCnt 1;
Case 8: Store the result in HDFS . The result is saved in output directory under sonoo folder.
1. grunt> STORE result into 'home/sonoo/output';
o Java
o Python
o Jython
o JavaScript
o Ruby
o Groovy
Among all the languages, Pig provides the most extensive support for Java functions.
However, limited support is provided to languages like Python, Jython, JavaScript, Ruby, and
Groovy.
Let's see an example of a simple EVAL Function to convert the provided string to uppercase.
7M
139
Java Try Catch
UPPER.java
1. package com.hadoop;
2.
3. import java.io.IOException;
4.
5. import org.apache.pig.EvalFunc;
6. import org.apache.pig.data.Tuple;
7.
8. public class TestUpper extends EvalFunc<String> {
9. public String exec(Tuple input) throws IOException {
10. if (input == null || input.size() == 0)
11. return null;
12. try{
13. String str = (String)input.get(0);
14. return str.toUpperCase();
15. }catch(Exception e){
16. throw new IOException("Caught exception processing input row ", e);
17. }
18. }
19. }
o Create the jar file and export it into the specific directory. For that ,right click on
project - Export - Java - JAR file - Next.
o Now, provide a specific name to the jar file and save it in a local system directory.
o Create a text file in your local machine and insert the list of tuples.
1. $ nano pigsample
1. $ hdfs dfs -put pigexample /pigexample
o Create a pig file in your local machine and write the script.
1. $ nano pscript.pig
1. $pig pscript.pig
Web link: Pig Tutorial - javatpoint