0% found this document useful (0 votes)

53 views

BDH Exp-4 I232

The document provides information about an Apache Pig hands-on lab experiment. It discusses the aim of the experiment to use Apache Pig to find the most occurred starting letter in a data file. It then provides steps to load the data, tokenize it, extract the first letter of each word, group by letter, count occurrences, order by count, limit to the highest count, and store the result. The document also provides background information on Apache Pig, its data types, and how to create Pig UDFs (user defined functions) with examples in Java.

Uploaded by

Namra Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views

BDH Exp-4 I232

Uploaded by

Namra Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Name: Namra Shah

Class: MBATech (IT)

Roll Number: I-232
SAP ID: 70411119035
Batch: B

EXP4

Aim: Apache Pig hands on lab

Prerequisites:
Java is the primary requirement for running Hadoop on any system, So make sure you have
Java installed on your system.

Theory:

What is Apache Pig

Apache Pig is a high-level data flow platform for executing MapReduce programs of
Hadoop. The language used for Pig is Pig Latin.

The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored
in HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark.

Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores the
corresponding results into Hadoop Data File System. Every task which can be achieved using
PIG can also be achieved using java used in MapReduce.

Differences between Apache MapReduce and PIG

Apache MapReduce Apache PIG

It is a low-level data processing tool. It is a high-level data flow tool.

Here, it is required to develop complex It is not required to develop complex programs.

programs using Java or Python.

It is difficult to perform data operations in It provides built-in operators to perform data operations
MapReduce. like union, sorting and ordering.

It doesn't allow nested data types. It provides nested data types like tuple, bag, and map.
Pig Data Types
Apache Pig supports many data types. A list of Apache Pig Data Types with description and
examples are given below.

Type Description Example

Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2

charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apache]

Pig Example
Use case: Using Pig find the most occurred start letter.

Solution:

Case 1: Load the data into bag named "lines". The entire line is stuck to element line of type
character array.

1. grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);

Case 2: The text in the bag lines needs to be tokenized this produces one word per row.

1. grunt>tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) As token: c
hararray;
Case 3: To retain the first letter of each word type the below command .This commands uses
substring method to take the first character.

1. grunt>letters = FOREACH tokens GENERATE SUBSTRING(0,1) as letter : charar
ray;

Case 4: Create a bag for unique character where the grouped bag will contain the same
character for each occurrence of that character.

1. grunt>lettergrp = GROUP letters by letter;

Case 5: The number of occurrence is counted in each group.

1. grunt>countletter = FOREACH lettergrp GENERATE group , COUNT(letters);

Case 6: Arrange the output according to count in descending order using the commands
below.

1. grunt>OrderCnt = ORDER countletter BY $1 DESC;

Case 7: Limit to One to give the result.

1. grunt> result =LIMIT OrderCnt 1;

Case 8: Store the result in HDFS . The result is saved in output directory under sonoo folder.

1. grunt> STORE result into 'home/sonoo/output';

Pig UDF (User Defined Functions)

To specify custom processing, Pig provides support for user-defined functions (UDFs). Thus,
Pig allows us to create our own functions. Currently, Pig UDFs can be implemented using the
following programming languages: -

o Java
o Python
o Jython
o JavaScript
o Ruby
o Groovy
Among all the languages, Pig provides the most extensive support for Java functions.
However, limited support is provided to languages like Python, Jython, JavaScript, Ruby, and
Groovy.

Example of Pig UDF

In Pig,

o All UDFs must extend "org.apache.pig.EvalFunc"

o All functions must override the "exec" method.

Let's see an example of a simple EVAL Function to convert the provided string to uppercase.

7M
139
Java Try Catch

UPPER.java

1. package com.hadoop;
2.
3. import java.io.IOException;
4.
5. import org.apache.pig.EvalFunc;
6. import org.apache.pig.data.Tuple;
7.
8. public class TestUpper extends EvalFunc<String>   {
9.     public String exec(Tuple input) throws IOException {
10.         if (input == null || input.size() == 0)
11.         return null;
12.         try{
13.                     String str = (String)input.get(0);
14.         return str.toUpperCase();
15.         }catch(Exception e){
16.         throw new IOException("Caught exception processing input row ", e);
17.                 }
18.             }
19. }

o Create the jar file and export it into the specific directory. For that ,right click on
project - Export - Java - JAR file - Next.
o Now, provide a specific name to the jar file and save it in a local system directory.
o Create a text file in your local machine and insert the list of tuples.

1. $ nano pigsample

o Upload the text files on HDFS in the specific directory.

1. $ hdfs dfs -put pigexample /pigexample

o Create a pig file in your local machine and write the script.

1. $ nano pscript.pig

o Now, run the script in the terminal to get the output.

1. $pig pscript.pig
Web link: Pig Tutorial - javatpoint

Execute pig commands/ program and paste here screen shot of

command and output

WORD COUNT PROGRAM

Module 3 Conditional Statements and Loops
50% (2)
Module 3 Conditional Statements and Loops
5 pages
pig skb
No ratings yet
pig skb
7 pages
bda unit 4
No ratings yet
bda unit 4
16 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
BDA Unit-4-PPT
No ratings yet
BDA Unit-4-PPT
98 pages
Unit 4
No ratings yet
Unit 4
29 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Lecture 12
No ratings yet
Lecture 12
21 pages
BDA-V
No ratings yet
BDA-V
10 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
BDP U4
No ratings yet
BDP U4
58 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
BDA_UNIT-4-PIG-Notes
No ratings yet
BDA_UNIT-4-PIG-Notes
9 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
pig
No ratings yet
pig
23 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
06-Pig-01-Intro-1
No ratings yet
06-Pig-01-Intro-1
23 pages
BDA-Unit 5-notes
No ratings yet
BDA-Unit 5-notes
36 pages
Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
bda-unit-4-060115-big-data-analytics-unit-4
No ratings yet
bda-unit-4-060115-big-data-analytics-unit-4
19 pages
6 part2
No ratings yet
6 part2
45 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
BDA unit5
No ratings yet
BDA unit5
36 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
Unit 4
No ratings yet
Unit 4
20 pages
4.1_PIG_UNIT4
No ratings yet
4.1_PIG_UNIT4
55 pages
UNIT-5
No ratings yet
UNIT-5
24 pages
Pig
No ratings yet
Pig
16 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Apache Pig
No ratings yet
Apache Pig
6 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
28 pages
Unit 5
No ratings yet
Unit 5
76 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Pig
No ratings yet
Pig
59 pages
Unit 5 Lecture No-2(PIG)
No ratings yet
Unit 5 Lecture No-2(PIG)
101 pages
PIG
No ratings yet
PIG
9 pages
Unit 5
No ratings yet
Unit 5
39 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
Unit IV
No ratings yet
Unit IV
36 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Nosql 24 011 Pig
No ratings yet
Nosql 24 011 Pig
41 pages
Pig
No ratings yet
Pig
6 pages
Unit No. 8
No ratings yet
Unit No. 8
24 pages
Apache Pig
No ratings yet
Apache Pig
4 pages
Unit5 Bigdatanotes
No ratings yet
Unit5 Bigdatanotes
52 pages
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
No ratings yet
Unit - V PIG Hadoop & Big Data: Pig Latin. This Language Provides Various Operators Using Which Programmers
9 pages
unit-4-apachepig-210825041412
No ratings yet
unit-4-apachepig-210825041412
16 pages
BDA_HIVE & PIG-Other Notes in Detail
No ratings yet
BDA_HIVE & PIG-Other Notes in Detail
162 pages
BDA_UNIT_IV_NOTES (1)
No ratings yet
BDA_UNIT_IV_NOTES (1)
32 pages
BDA
No ratings yet
BDA
16 pages
3 Pig
No ratings yet
3 Pig
77 pages
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
Python Pranks and Mischief with NLP
From Everand
Python Pranks and Mischief with NLP
Edward Franklin
No ratings yet
UI5 Basic
No ratings yet
UI5 Basic
6 pages
Gradle
No ratings yet
Gradle
6 pages
Placement Data MSC DS 22
No ratings yet
Placement Data MSC DS 22
3 pages
Bpops103 Module4
No ratings yet
Bpops103 Module4
47 pages
Java SYMarks Packages, Compiling and Code
No ratings yet
Java SYMarks Packages, Compiling and Code
6 pages
Logcat
No ratings yet
Logcat
721 pages
Deepinfra Documentation
No ratings yet
Deepinfra Documentation
2 pages
Adobe Scan 02-Dec-2022 - 221202 - 143343
No ratings yet
Adobe Scan 02-Dec-2022 - 221202 - 143343
13 pages
Com - Magma.cheat Logcat
No ratings yet
Com - Magma.cheat Logcat
14 pages
A Project Report On Bank Management System in Python - 20231215 - 073348 - 0000
No ratings yet
A Project Report On Bank Management System in Python - 20231215 - 073348 - 0000
7 pages
Question bank for module 2
No ratings yet
Question bank for module 2
2 pages
PROGRAMMING
No ratings yet
PROGRAMMING
37 pages
Java Programming 2nd Periodical Test NOTES
No ratings yet
Java Programming 2nd Periodical Test NOTES
5 pages
CPP Template Generic A3
No ratings yet
CPP Template Generic A3
9 pages
Lec 01
No ratings yet
Lec 01
36 pages
Ds Lab Manual 2020 - Final
No ratings yet
Ds Lab Manual 2020 - Final
78 pages
CC103-N FOP Question bank
No ratings yet
CC103-N FOP Question bank
3 pages
base
No ratings yet
base
9 pages
8051 Microcontroller Programming
No ratings yet
8051 Microcontroller Programming
3 pages
Ds Lab Manual
No ratings yet
Ds Lab Manual
61 pages
Assignment 4
No ratings yet
Assignment 4
12 pages
CSE109_Section_B_C_Outline
No ratings yet
CSE109_Section_B_C_Outline
3 pages
7. DATA SCIENCE - PYTHON DATA TYPES - 14 - 04 - 2025
No ratings yet
7. DATA SCIENCE - PYTHON DATA TYPES - 14 - 04 - 2025
6 pages
JavaScript Level One
No ratings yet
JavaScript Level One
53 pages
SBMS 6am 19012023
No ratings yet
SBMS 6am 19012023
2 pages
Lab 7
No ratings yet
Lab 7
11 pages
CD Questions (Unit-4)
No ratings yet
CD Questions (Unit-4)
3 pages
Engineering College, Ajmer: Department of Computer Science
No ratings yet
Engineering College, Ajmer: Department of Computer Science
19 pages
PHP Loop Statement
No ratings yet
PHP Loop Statement
18 pages

BDH Exp-4 I232

Uploaded by

BDH Exp-4 I232

Uploaded by

Name: Namra Shah

Class: MBATech (IT)

Aim: Apache Pig hands on lab

What is Apache Pig

Differences between Apache MapReduce and PIG

Apache MapReduce Apache PIG

It is a low-level data processing tool. It is a high-level data flow tool.

Here, it is required to develop complex It is not required to develop complex programs.

Type Description Example

Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2

charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apache]

Case 5: The number of occurrence is counted in each group.

Case 7: Limit to One to give the result.

Pig UDF (User Defined Functions)

Example of Pig UDF

o All UDFs must extend "org.apache.pig.EvalFunc"

o Upload the text files on HDFS in the specific directory.

o Now, run the script in the terminal to get the output.

Execute pig commands/ program and paste here screen shot of

WORD COUNT PROGRAM

You might also like