0% found this document useful (0 votes)

7 views10 pages

BDA-V

The document provides a comprehensive guide on installing and running Apache Pig and Hive, detailing steps for installation, running scripts, and using Pig Latin for data processing. It also covers user-defined functions in Pig, relational operations, and the similarities and differences between Apache Pig, Hive, MapReduce, and SQL. Additionally, it includes examples of data manipulation and querying operations in HiveQL.

Uploaded by

manzoor22022003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views10 pages

BDA-V

Uploaded by

manzoor22022003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

BDA – V

Pig: Installing and Running Pig, an Example, Comparision with Databases, Pig Latin,
User Defined Functions, Data Processing Operators, Hive: The Hive Shell, An Example,
Running Hive, Comparision with Traditional Databases, HiveQL, Tables, Querying Data

Installing Pig:
To install Apache Pig:
1. Install Java: Ensure Java is installed (sudo apt install openjdk-8-jdk for Ubuntu).
2. Install Hadoop: Install Hadoop if not already installed.
3. Download Apache Pig: Get the latest release from Apache Pig's website.
4. Extract the Archive: Unzip the downloaded tar.gz file (tar -xvf pig-0.x.x.tar.gz).
5. Set Environment Variables: Add export PIG_HOME=/opt/pig and export
PATH=$PATH:$PIG_HOME/bin to ~/.bashrc or ~/.bash_profile.
6. Verify Installation: Run pig to check if it's working.

Running Pig:
To run Pig files:
1. Create a Pig script: Write your Pig script with .pig extension.

2. Run in local mode:

pig -x local your_script.pig

3. Run in Hadoop mode: (If Hadoop is set up)

pig -x mapreduce your_script.pig

Running Pig Programs

There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode:

1. Script
Pig can run a script file that contains Pig commands. For example, pig script.pig runs
the commands in the local file script.pig. Alternatively, for very short scripts, you can
use the -e option to run a script specified as a string on the command line.

Command: Run a script file with Pig commands.

Example:
pig script.pig
You can also run short scripts directly from the command line using the -e option:
pig -e "your Pig command here"

2. Grunt
Grunt is an interactive shell for running Pig commands. Grunt is started when no file is
specified for Pig to run, and the -e option is not used. It is also possible to run Pig
scripts from within Grunt using run and exec.
Example: pig
3. Embedded
You can run Pig programs from Java using the PigServer class, much like you can use
JDBC to run SQL programs from Java. For programmatic access to Grunt, use
PigRunner.

Command: You can embed Pig scripts within Java code using the PigServer class.
Example:
PigServer pigServer = new PigServer(ExecType.LOCAL);
pigServer.registerQuery("your Pig command");

Pig Latin:
Pig Latin is a high-level data flow language used in Apache Pig for processing and
analyzing large datasets in Hadoop. It simplifies the process of writing MapReduce
programs by providing a language that's easier to write and read.

 Data Processing Language: Pig Latin allows users to describe data

transformations, including filtering, grouping, and joining, without writing
complex MapReduce code.
 Easy Syntax: Its syntax is more intuitive than traditional MapReduce, making it
easier to write, read, and maintain.
 Data Model: Pig Latin uses a relational data model, working with datasets in the
form of bags, tuples, and fields.

Comments
 Comments in Pig Latin can be written using:
o Single-line comments: Prefixed with --.
-- This is a single-line comment
o Multi-line comments: Enclosed in /* */.
/* This is a
multi-line comment */

Input and Output

Pig deals with input and output using three main operations:
1. Load: Loads data from the filesystem into Pig for processing.
o Example:
data = LOAD 'input.txt' USING PigStorage(',') AS (name:chararray,
age:int);
This loads the data from input.txt, using , as a delimiter, and defines the schema with
fields name and age.

2. Store: Stores the output data back to the filesystem.

o Example:
STORE data INTO 'output.txt' USING PigStorage(',');
This stores the processed data into output.txt.

3. Dump: Prints the output of the data to the console (useful for small datasets).
o Example:
DUMP data;

Relational Operations
Pig provides various relational operations to manipulate data.
1. foreach: Applies a transformation to each tuple in the dataset.
o Example:
transformed_data = FOREACH data GENERATE name, age + 1;

2. Filter: Filters data based on a condition.

o Example:
filtered_data = FILTER data BY age > 30;

3. Group: Groups data by a specific field.

o Example:
grouped_data = GROUP data BY name;

4. Order By: Sorts the data based on one or more fields.

o Example:
ordered_data = ORDER data BY age DESC;

5. Distinct: Removes duplicate tuples from the data.

o Example:
unique_data = DISTINCT data;

6. Join: Joins two datasets based on a common field.

o Example:
joined_data = JOIN data1 BY name, data2 BY name;

7. Limit: Limits the number of tuples returned.

o Example:
limited_data = LIMIT data 10;

8. Sample: Randomly samples a fraction of the data.

o Example:
sampled_data = SAMPLE data 0.1;
This will randomly select 10% of the data.

User Defined Functions in Pig:

In addition to the built-in functions, Apache Pig provides extensive support for User
Defined Functions (UDF’s). Using these UDF’s, we can define our own functions and
use them. The UDF support is provided in six programming languages, namely, Java,
Jython, Python, JavaScript, Ruby and Groovy.
1. Filter Functions UDF
Filter Functions are used to filter out specific data from a dataset based on some
condition.
 Purpose: To remove or filter records based on certain criteria.
 Common Use: Typically used when you need to apply custom filtering
conditions on the data.

Example: Filter Function UDF in Java

Let’s say we want to filter out records where the age is below 30. We’ll write a custom
Filter Function UDF for this.

1. Java UDF Code (Filter function to keep only people older than 30):

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.schema.Schema;

public class AgeFilterUDF extends EvalFunc<Boolean> {

public Boolean exec(Tuple input) {
if (input == null || input.size() == 0)
return null;
int age = (Integer) input.get(0); // Get the age from the tuple
return age > 30; // Only return true if age is greater than 30
}
}

2. Using the Filter Function in Pig Script:

REGISTER 'AgeFilterUDF.jar';
data = LOAD 'people.txt' USING PigStorage(',') AS (name:chararray, age:int);
filtered_data = FILTER data BY AgeFilterUDF(age);
DUMP filtered_data;

2. Eval Functions UDF

Eval Functions are used to evaluate data and transform it, such as calculating new
fields, performing operations, or generating new data based on existing data.
 Purpose: To transform or compute new data from existing fields.
 Common Use: To perform operations like mathematical calculations, string
manipulations, etc.
Example: Eval Function UDF in Java
Let’s say we want to create a UDF that calculates the year of birth based on the
current year (assuming the current year is 2025).

1. Java UDF Code (Eval function to calculate the year of birth):

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class BirthYearUDF extends EvalFunc<Integer> {

public Integer exec(Tuple input) {
if (input == null || input.size() == 0)
return null;
int age = (Integer) input.get(0); // Get age from the tuple
return 2025 - age; // Subtract age from current year (2025) to get the birth year
}
}

2. Using the Eval Function in Pig Script:

REGISTER 'BirthYearUDF.jar';
data = LOAD 'people.txt' USING PigStorage(',') AS (name:chararray, age:int);
birth_year_data = FOREACH data GENERATE name, age, BirthYearUDF(age) AS
birth_year;
DUMP birth_year_data;

3. Loader Functions UDF

Loader Functions are used to load data from external sources into Pig. You can use
built-in loaders (e.g., PigStorage) or write your own custom loader if the data format is
unusual or proprietary.
 Purpose: To load data from files or databases into the Pig environment.
 Common Use: Used for reading data from various formats that Pig can’t load
natively.

Example: Loader Function UDF in Java

Let’s say we need to write a custom loader to load a proprietary data format where
fields are separated by a special delimiter (e.g., |).

Introduction to Apache Hive:

Apache Hive is a data warehousing and SQL-like query language system built on top
of Apache Hadoop. It is primarily designed to handle large-scale data storage and
querying. Hive enables users to run queries on massive datasets stored in Hadoop's
HDFS (Hadoop Distributed File System) using a language called HiveQL, which is
similar to SQL (Structured Query Language).
Hive abstracts the complexity of MapReduce by providing a higher-level interface for
querying and managing large datasets. It is widely used for managing and analyzing
large-scale datasets, particularly in environments with big data.

SQL-Like Query Language (HiveQL):

 Hive provides an interface that is similar to SQL, called HiveQL (or HQL). Users
can write SQL-like queries to retrieve, manipulate, and analyze data in a familiar
way.
 It supports common SQL operations like SELECT, JOIN, GROUP BY, ORDER BY,
INSERT, etc.
Data Warehouse Infrastructure:
 Hive is essentially a data warehouse built on top of Hadoop. It provides tools
for querying, summarizing, and analyzing data stored in Hadoop's HDFS.
Scalability:
 Hive is designed to work with large datasets and can handle massive volumes of
data, taking advantage of Hadoop's distributed storage and processing
capabilities.
Extensibility:
 Hive supports user-defined functions (UDFs), which can be written in Java, to
extend its functionality for custom queries and processing.

Hive Shell:
The Hive Shell is an interactive command-line interface (CLI) used to interact with
Apache Hive. It provides a convenient way to run HiveQL queries, manage
databases and tables, and perform other operations within the Hive environment. The
Hive Shell is typically used by data analysts, data engineers, and administrators for
querying, managing, and analyzing data stored in Hadoop or HDFS (Hadoop
Distributed File System).

The Hive Shell allows users to execute HiveQL commands directly on the Hadoop
cluster, making it an essential tool for day-to-day interactions with Hive.
Launching the Hive Shell: To start the Hive Shell, you need to have Apache Hive
installed and configured on your machine or cluster. Once installed, you can launch
the Hive Shell by simply typing the following command in the terminal:
Example : hive

Running Hive:
1. Hive Shell: The Hive Shell provides an interactive command-line interface to
run HiveQL queries directly. You can start it by typing hive in the terminal, and it
will allow you to execute queries, create databases, manage tables, and more.
2. Batch Mode (Hive Scripts): Hive supports executing multiple queries stored
in a script file. You can run a script using the -f option, which allows for batch
processing of HiveQL commands.
Example: hive -f my_script.hql

Apache Pig vs MapReduce vs Hive vs SQL: Key Differences

Aspect Apache Pig MapReduce Hive SQL
Definition A high-level A low-level, A data A standard
platform for distributed warehouse query language
processing and programming system built onused for
analyzing large model for Hadoop, using managing and
datasets in processing HiveQL (similarmanipulating
Hadoop using large datasets to SQL) for relational
Pig Latin. in parallel. querying. databases.
Level of High-level Low-level High-level High-level
Abstractio abstraction programming abstraction abstraction for
n (using Pig Latin) model (Java- (uses HiveQL, querying
based) similar to SQL)relational
databases
Programmi Data flow Map and Query language Structured
ng Model language (Pig Reduce (HiveQL, SQL- Query
Latin) functions like syntax) Language (SQL)
for relational
databases
Ease of Easier to write Requires Easier to use Highly
Use than extensive than abstracted and
MapReduce; programming MapReduce easy to use for
more user- (usually in with SQL-like querying
friendly with Java) syntax relational data
fewer lines of
code
Performan Less efficient Provides full Optimized for Highly
ce than MapReduce control over complex optimized for
in some cases, processing but queries and relational
but faster for requires large datasets, databases but
many operations manual but not as low- not designed for
due to optimization level as big data
optimization MapReduce processing
techniques
Use Case ETL jobs, data Complex, low- Data Data
processing, level tasks warehousing, manipulation,
batch requiring fine- analytical transactional
processing tasks grained control queries on queries in
in Hadoop over data large datasets relational
processing databases
Execution Pig is based on Directly Uses SQL engine in
Engine MapReduce but executes Map MapReduce or relational
abstracts it for and Reduce Tez for query database
easier use jobs on Hadoop execution in the management
clusters background systems
(RDBMS)
Data Works with flat Works with Works with Works with
Model files, text, and large datasets structured data structured data
structured data processed (tables, (tables, rows,
stored in through key- partitions) in columns) in
Hadoop value pairs Hadoop relational
databases
Support Supports Complex to Easy to write Simple and
for Joins different types implement joins with SQL- powerful join
of joins (e.g., joins manually like syntax in operations in
inner, outer) in MapReduce HiveQL SQL
with ease
Data Typically Data is Primarily uses Uses relational
Storage interacts with processed HDFS for data databases
HDFS (Hadoop across HDFS or storage (MySQL,
Distributed File other PostgreSQL,
System) distributed file etc.) for data
systems storage

HiveQL:
HiveQL (Hive Query Language) is the query language used in Apache Hive, a data
warehousing system built on top of Hadoop. It is a SQL-like language that enables
users to perform data manipulation, analysis, and querying of large datasets stored in
HDFS (Hadoop Distributed File System) and other storage systems integrated with
Hadoop. HiveQL allows users to interact with Hadoop through a familiar syntax,
making it easier for people with SQL backgrounds to work with big data.

Data Definition Operations (DDL):

1. CREATE DATABASE:
CREATE DATABASE my_database;

2. USE DATABASE:
USE my_database;

3. CREATE TABLE:
CREATE TABLE employees (
id INT,
name STRING,
salary FLOAT
);

4. CREATE EXTERNAL TABLE:

CREATE EXTERNAL TABLE employees (
id INT,
name STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/path/to/external/data';
5. DROP DATABASE:
DROP DATABASE my_database CASCADE;

6. DROP TABLE:
DROP TABLE employees;

7. ALTER TABLE:
ALTER TABLE employees ADD COLUMNS (age INT);

Data Manipulation Operations (DML):

1. INSERT INTO:
INSERT INTO employees VALUES (1, 'John Doe', 5000.0);

2. INSERT OVERWRITE:
INSERT OVERWRITE TABLE employees SELECT * FROM
temp_employees;

3. LOAD DATA:
LOAD DATA INPATH '/path/to/data.csv' INTO TABLE employees;

4. SELECT:
SELECT * FROM employees WHERE salary > 4000;

Querying Operations:
1. SELECT:
SELECT name, salary FROM employees WHERE salary > 4000;

2. GROUP BY:
SELECT department, AVG(salary) FROM employees GROUP BY
department;

3. HAVING:
SELECT department, COUNT(*) FROM employees GROUP BY department HAVING
COUNT(*) > 5;

4. ORDER BY:
SELECT * FROM employees ORDER BY salary DESC;

5. LIMIT:
SELECT * FROM employees LIMIT 10;

6. JOIN:
SELECT e.name, d.department_name
FROM employees e
JOIN departments d
ON e.department_id = d.department_id;

7. DISTINCT:
SELECT DISTINCT department FROM employees;

Exam 1Z0-071: IT Certification Guaranteed, The Easy Way!
100% (5)
Exam 1Z0-071: IT Certification Guaranteed, The Easy Way!
101 pages
Unit 5 Lecture No-2(PIG)
No ratings yet
Unit 5 Lecture No-2(PIG)
101 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
bda-unit-4-060115-big-data-analytics-unit-4
No ratings yet
bda-unit-4-060115-big-data-analytics-unit-4
19 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
bda unit 4
No ratings yet
bda unit 4
16 pages
Unit 4
No ratings yet
Unit 4
29 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
UNIT-5
No ratings yet
UNIT-5
24 pages
Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Notes
No ratings yet
Notes
19 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Unit 5(Pig,Hive,Hbase)
No ratings yet
Unit 5(Pig,Hive,Hbase)
18 pages
BDA-Unit 5-notes
No ratings yet
BDA-Unit 5-notes
36 pages
Unit 5 Lecture No-2(PIG)
No ratings yet
Unit 5 Lecture No-2(PIG)
94 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
unit5-part1-notes
No ratings yet
unit5-part1-notes
21 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
big-data-unit-5-big-data-notes-of-unit-5
No ratings yet
big-data-unit-5-big-data-notes-of-unit-5
16 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Chapter 5 - Introducing Pig Pig Architecture
No ratings yet
Chapter 5 - Introducing Pig Pig Architecture
81 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
big data - unit 5 - frame works - mini xerox- easy read
No ratings yet
big data - unit 5 - frame works - mini xerox- easy read
23 pages
BDA Unit-4-PPT
No ratings yet
BDA Unit-4-PPT
98 pages
6 part2
No ratings yet
6 part2
45 pages
06-Pig-01-Intro-1
No ratings yet
06-Pig-01-Intro-1
23 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
BigData Unit 4
No ratings yet
BigData Unit 4
13 pages
Pig_2
No ratings yet
Pig_2
63 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Unit 5
No ratings yet
Unit 5
16 pages
Pig
No ratings yet
Pig
16 pages
Apache Pig Handy Notes Lab
No ratings yet
Apache Pig Handy Notes Lab
11 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Apache Pig in noSql Databases
No ratings yet
Apache Pig in noSql Databases
5 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
BDH Exp-4 I232
No ratings yet
BDH Exp-4 I232
8 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Pig_Notes-1
No ratings yet
Pig_Notes-1
6 pages
Unit-4_PIG_
No ratings yet
Unit-4_PIG_
9 pages
Unit v Notes
No ratings yet
Unit v Notes
17 pages
BDP U4
No ratings yet
BDP U4
58 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Big_Data_Unit-5
No ratings yet
Big_Data_Unit-5
81 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Holiday Homework For Winter Break Class X 402
No ratings yet
Holiday Homework For Winter Break Class X 402
21 pages
SAND6211Ea 2018
No ratings yet
SAND6211Ea 2018
5 pages
DBW301 Test 2
No ratings yet
DBW301 Test 2
3 pages
SQL Queries
No ratings yet
SQL Queries
31 pages
Querying Using SELECT - Installation Instructions
No ratings yet
Querying Using SELECT - Installation Instructions
21 pages
It Ba 2 Module 5
No ratings yet
It Ba 2 Module 5
23 pages
Sap BW: Course Content
No ratings yet
Sap BW: Course Content
2 pages
Cheatsheet 6
No ratings yet
Cheatsheet 6
1 page
Mastering Data Cleaning Techniques with SQL — Explained Examples _ by ? panData _ Level Up Coding
No ratings yet
Mastering Data Cleaning Techniques with SQL — Explained Examples _ by ? panData _ Level Up Coding
31 pages
Quiz 2 Version 2
No ratings yet
Quiz 2 Version 2
8 pages
Datatypes in ORACLE
No ratings yet
Datatypes in ORACLE
5 pages
MDM 104 PerformanceTuningGuide en
No ratings yet
MDM 104 PerformanceTuningGuide en
53 pages
Step by Step Control file recovery
No ratings yet
Step by Step Control file recovery
6 pages
Identify Physical Database Requirement
No ratings yet
Identify Physical Database Requirement
17 pages
Sap Hana Docs
No ratings yet
Sap Hana Docs
9 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
16 pages
ACAv3 EN US PDF M15 Student Guide
No ratings yet
ACAv3 EN US PDF M15 Student Guide
119 pages
Errors Codes SQL
No ratings yet
Errors Codes SQL
368 pages
Micron Interview Questions Summary # Question 1 Parsing The HTML Webpages
No ratings yet
Micron Interview Questions Summary # Question 1 Parsing The HTML Webpages
2 pages
DBMS_Exam_Questions
No ratings yet
DBMS_Exam_Questions
3 pages
Power BI Interview Questions and Answers For 2022
No ratings yet
Power BI Interview Questions and Answers For 2022
23 pages
UAT Database Observations
No ratings yet
UAT Database Observations
25 pages
CSC 221 - Comp Appreciation
No ratings yet
CSC 221 - Comp Appreciation
38 pages
Data Warehousing & Data Mining Unit-2 Notes
100% (1)
Data Warehousing & Data Mining Unit-2 Notes
36 pages
Kahalagahan NG Edukasyon Term Paper
100% (1)
Kahalagahan NG Edukasyon Term Paper
7 pages
A Peek Behind Colossus, Google's File System Google Cloud Blog
No ratings yet
A Peek Behind Colossus, Google's File System Google Cloud Blog
5 pages
19c Oracle Data Pump Whats New
No ratings yet
19c Oracle Data Pump Whats New
14 pages
Import Data from Excel sheet to DB Table through OAF page (Welcome to My Oracle World)
No ratings yet
Import Data from Excel sheet to DB Table through OAF page (Welcome to My Oracle World)
69 pages
Edi/Idoc: Deleting and Reorganizing Idocs With Sara
No ratings yet
Edi/Idoc: Deleting and Reorganizing Idocs With Sara
5 pages

BDA-V

Uploaded by

BDA-V

Uploaded by

BDA – V

2. Run in local mode:

3. Run in Hadoop mode: (If Hadoop is set up)

Running Pig Programs

Command: Run a script file with Pig commands.

 Data Processing Language: Pig Latin allows users to describe data

Input and Output

2. Store: Stores the output data back to the filesystem.

2. Filter: Filters data based on a condition.

3. Group: Groups data by a specific field.

4. Order By: Sorts the data based on one or more fields.

5. Distinct: Removes duplicate tuples from the data.

6. Join: Joins two datasets based on a common field.

7. Limit: Limits the number of tuples returned.

8. Sample: Randomly samples a fraction of the data.

User Defined Functions in Pig:

Example: Filter Function UDF in Java

public class AgeFilterUDF extends EvalFunc<Boolean> {

2. Using the Filter Function in Pig Script:

2. Eval Functions UDF

1. Java UDF Code (Eval function to calculate the year of birth):

public class BirthYearUDF extends EvalFunc<Integer> {

2. Using the Eval Function in Pig Script:

3. Loader Functions UDF

Example: Loader Function UDF in Java

Introduction to Apache Hive:

SQL-Like Query Language (HiveQL):

Apache Pig vs MapReduce vs Hive vs SQL: Key Differences

Data Definition Operations (DDL):

4. CREATE EXTERNAL TABLE:

Data Manipulation Operations (DML):

You might also like