0% found this document useful (0 votes)

71 views

Lecture38 PDF

This document provides an overview of Pig and Hive for programming in Hadoop. Pig is a data flow scripting language that allows for data cleaning and normalization. Hive is a SQL-like query language that imposes structure on Hadoop data and creates data warehouse layers. Both Pig and Hive scripts are automatically translated to MapReduce jobs to process large datasets in Hadoop.

Uploaded by

newbies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views

Lecture38 PDF

Uploaded by

newbies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Programming in Hadoop

with Pig and Hive

Hadoop Review

• Hadoop is a open-source reimplementation of

– A distributed file system
– A map-reduce processing framework
– A big-table data model
• Inspired by Google’s own description of the
technologies underpinning their search engine
• It is a layer-cake of APIs, written mostly in Java,
that one can use to write large, distributed, and
scalable applications to search and process
large datasets
Hadoop Layer Cake

While Hadoop has many advantages, it is not intuitive to translate every

data exploration/manipulation/searching task into a serie of map-reduce
operations.

Higher-level languages were needed.

PIG (Data Flow) Hive (SQL emulation)

MapReduce (Job Scheduling and

shuffling)
Hbase (key-value store)

HDFS
(Hadoop Distributed File System)
High-level approaches for specifying Hadoop jobs

• PIG – A scripting language for transforming big data

• Useful for “cleaning” and “normalizing” data
• Three parts:
• Pig Latin – The scripting language
• Grunt – A interactive shell
• Piggybank – A repository of Pig extensions
• Deferred execution model

• Hive – A SQL-inspired query-oriented language

• Imposes structure, in the form of schemas, on
Hadoop data
• Creates “data warehouse” layers
Pig Latin’s data model

• PIG – A dataflow scripting language

o Automatically translated to a series of
Map-Reduce jobs that are run on Hadoop
o It requires no meta-data or schema
o It is extensible, via user-defined functions
(UDFs) written in Java or other languages
(C, Python, etc.)
o Provides run-time and debugging environments
o A language specifically designed for data
manipulations and analysis
 Supports join, sort, filter, etc.
 Automatically partitions large operations
into smaller jobs and chains them
together
Pig Latin scripts describe dataflows

• Every Pig Latin script describes one or more flows of

data through a series of operations that can be
processed in parallel (i.e. the next one can start before
the ones providing inputs to it finish).
• Dataflows are Directed Acyclic Graphs (DAGS)
• Ordering and Scheduling is deferred until a node
generates data

Load Foreach Join Sort Store

Load

Load Join Store

Pig Latin Processing

• Pig Latin script are processed line by line

• Syntax and References are checked
• Valid statements are added to a logical plan
• Execution is deferred until either a DUMP or
STORE statement is reached
• Reused intermediate results are mapped to
a common node
grunt> Mokepo = LOAD "monkepo.csv";
grunt> WetOnes = FILTER Monkepo BY $1='Wet' OR $2='Wet';
grunt> DUMP WetOnes;
Pig Relations

• Pig variables are bags of tuples

o Fields – data items
o Tuples – a vector of fields
o Bags – a collection of unordered tuples
 Unlike Relations in relational databases
the tuples in a Pig bag, need not have the
same number of fields, or the same types
 Pig also supports Maps
o Maps – a dictionary of name-value pairs
Bag Tuple
Tuple Field 0 Field 1 Field 2 … Field N
Tuple
Tuple

Tuple
Pig Latin Examples

• Pig scripts are easy to read

Mokepo = LOAD "monkepo.csv" USING PigStorage(',') AS
(name:chararray, majorclass:chararray, minorclass:chararray,
latitude:double, longitude:double, date:datetime);
WetOnes = FILTER Monkepo BY (majorclass='Wet' OR minorclass='Wet')
AND date >= '2016-11-23' AND date <= '2016-11-30';
Groups = GROUP WetOnes BY name;
STORE Groups INTO "WetTypes/";

• FOREACH to specify processing steps for all

tuples in a bagexample
e1 = LOAD "input/Employees" USING PigStorage(',') AS
(name:chararray, age:int, zip:int, salary:double);
f = FOREACH e1 GENERATE age, salary;
DESCRIBE f;
DUMP f;
More Pig Latin Examples

• ORDER
emp = LOAD "input/Employees" USING PigStorage(',') AS
(name:chararray, age:int, zip:int, salary:double);
sorted = ORDER emp BY salary;

• LIMIT
emp = LOAD "input/Employees" USING PigStorage(',') AS
(name:chararray, age:int, zip:int, salary:double);
agegroup = GROUP emp BY age;
shortlist = LIMIT agegroup 100;

• JOIN
emp = LOAD "input/Employees" USING PigStorage(',') AS
(name:chararray, age:int, zip:int, salary:double);
pbk = LOAD "input/Phonebook" USING PigStorage(',') AS
(name:chararray, phone:chararray);
contact = JOIN emp BY name, pbk BY name;
DESCRIBE contact;
DUMP contact;
Hive Query Language

• Hive is an alternative/complement to Pig

o Hive is a SQL-like Query language
o It imposes "Structure" on "Unstructured" data
o Needs a predefined schema definition
o It is also extensible, via user-defined functions (UDFs)
written in Java or other languages
(C, Python, etc.)
• Hive does NOT make Hadoop a relational database
o No transactions, no isolation, no consistency promises
o Searches and processes Hadoop data stores
o Not suitable for real-time queries and row-level updates
o Generally much higher latency than a DBMS,
but higher performance
o Best for batch jobs over large "immutable" data
Hive Query Language

• Hive is best used to perform analyses and

summaries over large data sets
• Hive requires a meta-store to keep information
about virtual tables
• It evaluates query plans, selects the most promising
one, and then evaluates it using a series of map-
reduce functions
• Hive is best used to answer a single instance of a
specific question whereas Pig is best used to
accomplish frequent reorganization, combining,
and reformatting tasks
Hive Query Language

• Hive is similar to SQL-92

• Based on familiar database concepts, tables, rows,
columns, and schemas
• Makes "Big Data" appear as tables on the fly
• Like Pig, Hive has a command-line shell
$ hive
hive>

• Or it can execute scripts

$ hive -f myquery.hive

• There are also WIMP interfaces

Defining Hive Tables

• A Hive table consists of

• Data linked to a file or multiple files in an HDFS
• Schema stored as mapping of the data to a set of
columns with types
• Schema and Data are separated
• Allows multiple schemas on the same data
$ hive
hive> CREATE TABLE Monkepo (
name string,
majorclass string,
minorclass string,
latitude real,
longitude real,
date datetime)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
More Operations on Hive Tables

Table Operation Command Syntax

See current tables hive> SHOW TABLES;

Check schema hive> DESCRIBE Monkepo;

Change table name hive> ALTER TABLE Monkepo

RENAME TO Pokemon;

Add a column hive> ALTER TABLE Monkepo

ADD COLUMNS (RealName, STRING);
Drop a partition hive> ALTER TABLE Monkepo
DROP PARTITION (Name='PigDye');
Loading Hive Tables

• Use LOAD DATA to import data into a HIVE table

$ hive
hive> LOAD DATA LOCAL INPATH 'monkepo.csv'
INTO TABLE Monkepo;

• No files are modified by Hive, the schema simply

imposes structure on the file as it is read
• You can use the keyword OVERWRITE to
modify previous loaded files
• Loading a file creates a "data warehouse"
• Schema is verified as data is queried
• Missing columns are mapped to NULL
Loading Hive Tables

• Use LOAD DATA to import data into a HIVE table

$ hive
hive> LOAD DATA LOCAL INPATH 'monkepo.csv'
INTO TABLE Monkepo;

• No files are modified by Hive, the schema simply

imposes structure on the file when it is read
• You can use the keyword OVERWRITE to
modify previous loaded files
hive> LOAD DATA INPATH '/project/monkepo.csv'
hive> OVERWRITE INTO TABLE Monkepo;
hive> INSERT INTO WetOnes
hive> SELECT * FROM Monkepo
hive> WHERE majorclass='wet'
hive> OR minorclass='wet';

• Missing columns are mapped to NULL

Loading Hive Tables

• Use LOAD DATA to import data into a HIVE table

$ hive
hive> LOAD DATA LOCAL INPATH 'monkepo.csv'
INTO TABLE Monkepo;

• No files are modified by Hive, the schema simply imposes

structure on the file when it is read
• You can use the keyword OVERWRITE to modify previous
loaded files
• Missing columns are mapped to NULL
• INSERT is used to populate one Hive table from another
hive> LOAD DATA INPATH '/project/monkepo.csv'
hive> OVERWRITE INTO TABLE Monkepo;
hive> INSERT INTO WetOnes
hive> SELECT * FROM Monkepo
hive> WHERE majorclass='wet'
hive> OR minorclass='wet';
Hive Queries

• SELECT
$ hive
hive> SELECT * FROM WetOnes;

• Supports the following:

• WHERE clause
• UNION ALL
• DISTINCT
• GROUP BY and HAVING
• LIMIT
• JOIN,
• LEFT OUTER JOIN, RIGHT OUTER JOIN, OUTER JOIN
• Returned rows are random, and may vary between calls
• Can use regular expressions in column specification
$ hive
hive> SELECT M.name, 'M.*class', 'M.l*ude'
hive> FROM Monkepo M;
Hive Query Examples

hive> SELECT * FROM customers;

hive> SELECT COUNT(*) FROM customers;
hive>
hive> SELECT first, last, address, zip FROM customers
hive> WHERE orderID > 0
hive> GROUP BY zip;
hive>
hive> SELECT customers.*, orders.*
hive> FROM customers JOIN orders
hive> ON (customers.customerID – orders.customerID);
hive>
hive> SELECT customers.*, orders.*
hive> FROM customers LEFT OUTER JOIN orders
hive> ON (customers.customerID – orders.customerID);

• If you understand SQL, you should be able to follow

• Note: These are queries, not transactions
• The data's state could change between and
within a query
Hive Subqueries

• Hive allows subqueries only within FROM clauses

hive> SELECT sid, mid, total FROM
hive> (SELECT sid, mid, refCnt + altCnt AS total
hive> FROM genotype) gtypeTotals
hive> WHERE total > 20;

• Subqueries are generally materialized

(computed and saved as hive tables)
• You MUST to include a name for the subquery result table
• The columns of a subquery's SELECT list are
available to the outer query
Sorting in Hive

• Hive supports ORDER BY, but its result differs from SQL's
• Only one Reduce step is applied and partial results are
broadcast and combined
• No need for any intermediate files
• This allows optimization to a single MapReduce step

• Hive also supports SORT BY with multiple fields

• Produces a "total ordering" of all results
• Might require multiple MapReduce operations
• Might materialize several intermediate tables
Summary

• There are two primary "high-level" programming languages for

Hadoop-- Pig and Hive

• Pig is a "scripting language" that excels in specifying a

processing pipeline that is automatically parallelized into
Map-Reduce operations
• Deferred execution allows for optimizations in scheduling
Map-Reduce operations
• Good for general data manipulation and cleaning
• Hive is a "query languge" that borrows heavily from
SQL, which excels for searching and summarizing data
• Requires the specification of an "external" schema
• Often materializes many more intermediate
results than a DBMS would

Equity Valuation Gs
No ratings yet
Equity Valuation Gs
76 pages
Chemical Bonds: Modular System
No ratings yet
Chemical Bonds: Modular System
72 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
BDA
No ratings yet
BDA
16 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
BDA IA-3 QB-1[1]
No ratings yet
BDA IA-3 QB-1[1]
17 pages
Big Data
No ratings yet
Big Data
120 pages
Pig, Hive, and Jaql: IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab
No ratings yet
Pig, Hive, and Jaql: IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab
40 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
6 H Data With Hive Big Data Analytics B.tech. Final Year
No ratings yet
6 H Data With Hive Big Data Analytics B.tech. Final Year
24 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
18 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Hive Pig
No ratings yet
Hive Pig
20 pages
CH 6 BDA
No ratings yet
CH 6 BDA
10 pages
Bda 06
No ratings yet
Bda 06
15 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
UNIT 5-1
No ratings yet
UNIT 5-1
8 pages
bdcc-2.4
No ratings yet
bdcc-2.4
5 pages
BDA-NOTES-JNTUK-R20-UNIT-4
No ratings yet
BDA-NOTES-JNTUK-R20-UNIT-4
14 pages
BigData Analytics Unit-V
No ratings yet
BigData Analytics Unit-V
21 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Module 4 - Pig
No ratings yet
Module 4 - Pig
65 pages
BD_Unit3_Summary_781df07f-8ff5-4069-8dd6-f5257e5ce394
No ratings yet
BD_Unit3_Summary_781df07f-8ff5-4069-8dd6-f5257e5ce394
6 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Unit 5
No ratings yet
Unit 5
5 pages
Unit-3 FBDA
No ratings yet
Unit-3 FBDA
34 pages
Unit v Notes
No ratings yet
Unit v Notes
17 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
HIVE
No ratings yet
HIVE
80 pages
Leçon4 Hadoop Query Languages
No ratings yet
Leçon4 Hadoop Query Languages
21 pages
Slide 5 High-Level Data Process Components Tutorial
No ratings yet
Slide 5 High-Level Data Process Components Tutorial
109 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
S Pig Hive HBase Zookeeper 07
No ratings yet
S Pig Hive HBase Zookeeper 07
21 pages
BDA Session 5
No ratings yet
BDA Session 5
41 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Big_Data_Unit-5
No ratings yet
Big_Data_Unit-5
81 pages
BD 5
No ratings yet
BD 5
28 pages
Chapter 5 - Introducing Pig Pig Architecture
No ratings yet
Chapter 5 - Introducing Pig Pig Architecture
81 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Unit 5 2 Marks
No ratings yet
Unit 5 2 Marks
10 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
BDA Unit-4-PPT
No ratings yet
BDA Unit-4-PPT
98 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Unit 4 Hadoop Eco System PDF
No ratings yet
Unit 4 Hadoop Eco System PDF
78 pages
BDS Session 8
No ratings yet
BDS Session 8
49 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
Apache Pig
No ratings yet
Apache Pig
61 pages
L Apachepigdataquery PDF
No ratings yet
L Apachepigdataquery PDF
10 pages
Bda - Module Ii
No ratings yet
Bda - Module Ii
239 pages
bda-unit-4-060115-big-data-analytics-unit-4
No ratings yet
bda-unit-4-060115-big-data-analytics-unit-4
19 pages
Unit 4
No ratings yet
Unit 4
29 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Horto Works Sa D Ox With Virtualbox: April 2015
No ratings yet
Horto Works Sa D Ox With Virtualbox: April 2015
16 pages
Sandbox 1270018080 Not Accessible
No ratings yet
Sandbox 1270018080 Not Accessible
2 pages
Enterprise COBOL Concepts: Dr. David Woolbright Woolbright - David@columbusstate - Edu 2013
No ratings yet
Enterprise COBOL Concepts: Dr. David Woolbright Woolbright - David@columbusstate - Edu 2013
254 pages
Cobol: by Rhonda Wright
No ratings yet
Cobol: by Rhonda Wright
12 pages
Exercise Income Approach
No ratings yet
Exercise Income Approach
2 pages
Project Management Professional Handbook
100% (1)
Project Management Professional Handbook
29 pages
Revista Design - News.Magazine
No ratings yet
Revista Design - News.Magazine
126 pages
Puerto Serial
No ratings yet
Puerto Serial
6 pages
Lesson Money Rich
No ratings yet
Lesson Money Rich
5 pages
Dulce Et Decorum Est
No ratings yet
Dulce Et Decorum Est
2 pages
Omnibus Sworn Statementstreetlight
No ratings yet
Omnibus Sworn Statementstreetlight
11 pages
SAVE THE DATE: FOUNDERS DAY Saturday, FEB. 22, 2014
No ratings yet
SAVE THE DATE: FOUNDERS DAY Saturday, FEB. 22, 2014
1 page
Bottle Rocket Project Parameters
No ratings yet
Bottle Rocket Project Parameters
19 pages
System Data Dll-Resources Dat
No ratings yet
System Data Dll-Resources Dat
63 pages
Uv Durability of Tencate Geosynthetics: Technical Note
No ratings yet
Uv Durability of Tencate Geosynthetics: Technical Note
8 pages
Topic 3: Frequency Selective Circuits
No ratings yet
Topic 3: Frequency Selective Circuits
109 pages
Harry Potter and the Order of the Phoenix
No ratings yet
Harry Potter and the Order of the Phoenix
3 pages
135) La Razon Social Go Tiaoco V Union Insurance
No ratings yet
135) La Razon Social Go Tiaoco V Union Insurance
2 pages
Three Minute Thesis Slide
100% (2)
Three Minute Thesis Slide
8 pages
7699-Văn Bản Của Bài Báo-12940-1-10-20230105
No ratings yet
7699-Văn Bản Của Bài Báo-12940-1-10-20230105
9 pages
Univolt Is Formulated To Improve Performance and Extend Transformer Life
No ratings yet
Univolt Is Formulated To Improve Performance and Extend Transformer Life
2 pages
A Practical Handbook of Speech Coders
No ratings yet
A Practical Handbook of Speech Coders
15 pages
Application For Oral Defense July 2020
No ratings yet
Application For Oral Defense July 2020
1 page
MGT 351 Case Study 2 Fall 2021 1
No ratings yet
MGT 351 Case Study 2 Fall 2021 1
2 pages
Napkin Foldings o Impress
100% (1)
Napkin Foldings o Impress
96 pages
Math 2 Makiling
No ratings yet
Math 2 Makiling
28 pages
Question Bank BMATE101
No ratings yet
Question Bank BMATE101
9 pages
Selling Strategy
100% (1)
Selling Strategy
6 pages
Plantito & Plantita Week 3 NSTP
No ratings yet
Plantito & Plantita Week 3 NSTP
4 pages
Solution Manual for Operations Research: An Introduction, 9/E 9th Edition Hamdy A. Taha download pdf
100% (20)
Solution Manual for Operations Research: An Introduction, 9/E 9th Edition Hamdy A. Taha download pdf
19 pages
Christie Sukhdeo CV Biologist For Weebly
No ratings yet
Christie Sukhdeo CV Biologist For Weebly
4 pages
Chapter 1) INTRODUCTION: (1.1) History
No ratings yet
Chapter 1) INTRODUCTION: (1.1) History
50 pages

Lecture38 PDF

Uploaded by

Lecture38 PDF

Uploaded by

Programming in Hadoop

with Pig and Hive

• Hadoop is a open-source reimplementation of

While Hadoop has many advantages, it is not intuitive to translate every

Higher-level languages were needed.

PIG (Data Flow) Hive (SQL emulation)

MapReduce (Job Scheduling and

• PIG – A scripting language for transforming big data

• Hive – A SQL-inspired query-oriented language

• PIG – A dataflow scripting language

• Every Pig Latin script describes one or more flows of

Load Foreach Join Sort Store

Load Join Store

• Pig Latin script are processed line by line

• Pig variables are bags of tuples

• Pig scripts are easy to read

• FOREACH to specify processing steps for all

• Hive is an alternative/complement to Pig

• Hive is best used to perform analyses and

• Hive is similar to SQL-92

• Or it can execute scripts

• There are also WIMP interfaces

• A Hive table consists of

Table Operation Command Syntax

Check schema hive> DESCRIBE Monkepo;

Change table name hive> ALTER TABLE Monkepo

Add a column hive> ALTER TABLE Monkepo

• Use LOAD DATA to import data into a HIVE table

• No files are modified by Hive, the schema simply

• Use LOAD DATA to import data into a HIVE table

• No files are modified by Hive, the schema simply

• Missing columns are mapped to NULL

• Use LOAD DATA to import data into a HIVE table

• No files are modified by Hive, the schema simply imposes

• Supports the following:

hive> SELECT * FROM customers;

• If you understand SQL, you should be able to follow

• Hive allows subqueries only within FROM clauses

• Subqueries are generally materialized

• Hive also supports SORT BY with multiple fields

• There are two primary "high-level" programming languages for

• Pig is a "scripting language" that excels in specifying a

You might also like