0% found this document useful (0 votes)
7 views10 pages

BDA-V

The document provides a comprehensive guide on installing and running Apache Pig and Hive, detailing steps for installation, running scripts, and using Pig Latin for data processing. It also covers user-defined functions in Pig, relational operations, and the similarities and differences between Apache Pig, Hive, MapReduce, and SQL. Additionally, it includes examples of data manipulation and querying operations in HiveQL.

Uploaded by

manzoor22022003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

BDA-V

The document provides a comprehensive guide on installing and running Apache Pig and Hive, detailing steps for installation, running scripts, and using Pig Latin for data processing. It also covers user-defined functions in Pig, relational operations, and the similarities and differences between Apache Pig, Hive, MapReduce, and SQL. Additionally, it includes examples of data manipulation and querying operations in HiveQL.

Uploaded by

manzoor22022003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

BDA – V

Pig: Installing and Running Pig, an Example, Comparision with Databases, Pig Latin,
User Defined Functions, Data Processing Operators, Hive: The Hive Shell, An Example,
Running Hive, Comparision with Traditional Databases, HiveQL, Tables, Querying Data

Installing Pig:
To install Apache Pig:
1. Install Java: Ensure Java is installed (sudo apt install openjdk-8-jdk for Ubuntu).
2. Install Hadoop: Install Hadoop if not already installed.
3. Download Apache Pig: Get the latest release from Apache Pig's website.
4. Extract the Archive: Unzip the downloaded tar.gz file (tar -xvf pig-0.x.x.tar.gz).
5. Set Environment Variables: Add export PIG_HOME=/opt/pig and export
PATH=$PATH:$PIG_HOME/bin to ~/.bashrc or ~/.bash_profile.
6. Verify Installation: Run pig to check if it's working.

Running Pig:
To run Pig files:
1. Create a Pig script: Write your Pig script with .pig extension.

2. Run in local mode:


pig -x local your_script.pig

3. Run in Hadoop mode: (If Hadoop is set up)


pig -x mapreduce your_script.pig

Running Pig Programs


There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode:

1. Script
Pig can run a script file that contains Pig commands. For example, pig script.pig runs
the commands in the local file script.pig. Alternatively, for very short scripts, you can
use the -e option to run a script specified as a string on the command line.

Command: Run a script file with Pig commands.


Example:
pig script.pig
You can also run short scripts directly from the command line using the -e option:
pig -e "your Pig command here"

2. Grunt
Grunt is an interactive shell for running Pig commands. Grunt is started when no file is
specified for Pig to run, and the -e option is not used. It is also possible to run Pig
scripts from within Grunt using run and exec.
Example: pig
3. Embedded
You can run Pig programs from Java using the PigServer class, much like you can use
JDBC to run SQL programs from Java. For programmatic access to Grunt, use
PigRunner.

Command: You can embed Pig scripts within Java code using the PigServer class.
Example:
PigServer pigServer = new PigServer(ExecType.LOCAL);
pigServer.registerQuery("your Pig command");

Pig Latin:
Pig Latin is a high-level data flow language used in Apache Pig for processing and
analyzing large datasets in Hadoop. It simplifies the process of writing MapReduce
programs by providing a language that's easier to write and read.

 Data Processing Language: Pig Latin allows users to describe data


transformations, including filtering, grouping, and joining, without writing
complex MapReduce code.
 Easy Syntax: Its syntax is more intuitive than traditional MapReduce, making it
easier to write, read, and maintain.
 Data Model: Pig Latin uses a relational data model, working with datasets in the
form of bags, tuples, and fields.

Comments
 Comments in Pig Latin can be written using:
o Single-line comments: Prefixed with --.
-- This is a single-line comment
o Multi-line comments: Enclosed in /* */.
/* This is a
multi-line comment */

Input and Output


Pig deals with input and output using three main operations:
1. Load: Loads data from the filesystem into Pig for processing.
o Example:
data = LOAD 'input.txt' USING PigStorage(',') AS (name:chararray,
age:int);
This loads the data from input.txt, using , as a delimiter, and defines the schema with
fields name and age.

2. Store: Stores the output data back to the filesystem.


o Example:
STORE data INTO 'output.txt' USING PigStorage(',');
This stores the processed data into output.txt.

3. Dump: Prints the output of the data to the console (useful for small datasets).
o Example:
DUMP data;

Relational Operations
Pig provides various relational operations to manipulate data.
1. foreach: Applies a transformation to each tuple in the dataset.
o Example:
transformed_data = FOREACH data GENERATE name, age + 1;

2. Filter: Filters data based on a condition.


o Example:
filtered_data = FILTER data BY age > 30;

3. Group: Groups data by a specific field.


o Example:
grouped_data = GROUP data BY name;

4. Order By: Sorts the data based on one or more fields.


o Example:
ordered_data = ORDER data BY age DESC;

5. Distinct: Removes duplicate tuples from the data.


o Example:
unique_data = DISTINCT data;

6. Join: Joins two datasets based on a common field.


o Example:
joined_data = JOIN data1 BY name, data2 BY name;

7. Limit: Limits the number of tuples returned.


o Example:
limited_data = LIMIT data 10;

8. Sample: Randomly samples a fraction of the data.


o Example:
sampled_data = SAMPLE data 0.1;
This will randomly select 10% of the data.

User Defined Functions in Pig:


In addition to the built-in functions, Apache Pig provides extensive support for User
Defined Functions (UDF’s). Using these UDF’s, we can define our own functions and
use them. The UDF support is provided in six programming languages, namely, Java,
Jython, Python, JavaScript, Ruby and Groovy.
1. Filter Functions UDF
Filter Functions are used to filter out specific data from a dataset based on some
condition.
 Purpose: To remove or filter records based on certain criteria.
 Common Use: Typically used when you need to apply custom filtering
conditions on the data.

Example: Filter Function UDF in Java


Let’s say we want to filter out records where the age is below 30. We’ll write a custom
Filter Function UDF for this.

1. Java UDF Code (Filter function to keep only people older than 30):

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.schema.Schema;

public class AgeFilterUDF extends EvalFunc<Boolean> {


public Boolean exec(Tuple input) {
if (input == null || input.size() == 0)
return null;
int age = (Integer) input.get(0); // Get the age from the tuple
return age > 30; // Only return true if age is greater than 30
}
}

2. Using the Filter Function in Pig Script:

REGISTER 'AgeFilterUDF.jar';
data = LOAD 'people.txt' USING PigStorage(',') AS (name:chararray, age:int);
filtered_data = FILTER data BY AgeFilterUDF(age);
DUMP filtered_data;

2. Eval Functions UDF


Eval Functions are used to evaluate data and transform it, such as calculating new
fields, performing operations, or generating new data based on existing data.
 Purpose: To transform or compute new data from existing fields.
 Common Use: To perform operations like mathematical calculations, string
manipulations, etc.
Example: Eval Function UDF in Java
Let’s say we want to create a UDF that calculates the year of birth based on the
current year (assuming the current year is 2025).

1. Java UDF Code (Eval function to calculate the year of birth):

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class BirthYearUDF extends EvalFunc<Integer> {


public Integer exec(Tuple input) {
if (input == null || input.size() == 0)
return null;
int age = (Integer) input.get(0); // Get age from the tuple
return 2025 - age; // Subtract age from current year (2025) to get the birth year
}
}

2. Using the Eval Function in Pig Script:

REGISTER 'BirthYearUDF.jar';
data = LOAD 'people.txt' USING PigStorage(',') AS (name:chararray, age:int);
birth_year_data = FOREACH data GENERATE name, age, BirthYearUDF(age) AS
birth_year;
DUMP birth_year_data;

3. Loader Functions UDF


Loader Functions are used to load data from external sources into Pig. You can use
built-in loaders (e.g., PigStorage) or write your own custom loader if the data format is
unusual or proprietary.
 Purpose: To load data from files or databases into the Pig environment.
 Common Use: Used for reading data from various formats that Pig can’t load
natively.

Example: Loader Function UDF in Java


Let’s say we need to write a custom loader to load a proprietary data format where
fields are separated by a special delimiter (e.g., |).

Introduction to Apache Hive:


Apache Hive is a data warehousing and SQL-like query language system built on top
of Apache Hadoop. It is primarily designed to handle large-scale data storage and
querying. Hive enables users to run queries on massive datasets stored in Hadoop's
HDFS (Hadoop Distributed File System) using a language called HiveQL, which is
similar to SQL (Structured Query Language).
Hive abstracts the complexity of MapReduce by providing a higher-level interface for
querying and managing large datasets. It is widely used for managing and analyzing
large-scale datasets, particularly in environments with big data.

SQL-Like Query Language (HiveQL):


 Hive provides an interface that is similar to SQL, called HiveQL (or HQL). Users
can write SQL-like queries to retrieve, manipulate, and analyze data in a familiar
way.
 It supports common SQL operations like SELECT, JOIN, GROUP BY, ORDER BY,
INSERT, etc.
Data Warehouse Infrastructure:
 Hive is essentially a data warehouse built on top of Hadoop. It provides tools
for querying, summarizing, and analyzing data stored in Hadoop's HDFS.
Scalability:
 Hive is designed to work with large datasets and can handle massive volumes of
data, taking advantage of Hadoop's distributed storage and processing
capabilities.
Extensibility:
 Hive supports user-defined functions (UDFs), which can be written in Java, to
extend its functionality for custom queries and processing.

Hive Shell:
The Hive Shell is an interactive command-line interface (CLI) used to interact with
Apache Hive. It provides a convenient way to run HiveQL queries, manage
databases and tables, and perform other operations within the Hive environment. The
Hive Shell is typically used by data analysts, data engineers, and administrators for
querying, managing, and analyzing data stored in Hadoop or HDFS (Hadoop
Distributed File System).

The Hive Shell allows users to execute HiveQL commands directly on the Hadoop
cluster, making it an essential tool for day-to-day interactions with Hive.
Launching the Hive Shell: To start the Hive Shell, you need to have Apache Hive
installed and configured on your machine or cluster. Once installed, you can launch
the Hive Shell by simply typing the following command in the terminal:
Example : hive

Running Hive:
1. Hive Shell: The Hive Shell provides an interactive command-line interface to
run HiveQL queries directly. You can start it by typing hive in the terminal, and it
will allow you to execute queries, create databases, manage tables, and more.
2. Batch Mode (Hive Scripts): Hive supports executing multiple queries stored
in a script file. You can run a script using the -f option, which allows for batch
processing of HiveQL commands.
Example: hive -f my_script.hql

Apache Pig vs MapReduce vs Hive vs SQL: Key Differences


Aspect Apache Pig MapReduce Hive SQL
Definition A high-level A low-level, A data A standard
platform for distributed warehouse query language
processing and programming system built onused for
analyzing large model for Hadoop, using managing and
datasets in processing HiveQL (similarmanipulating
Hadoop using large datasets to SQL) for relational
Pig Latin. in parallel. querying. databases.
Level of High-level Low-level High-level High-level
Abstractio abstraction programming abstraction abstraction for
n (using Pig Latin) model (Java- (uses HiveQL, querying
based) similar to SQL)relational
databases
Programmi Data flow Map and Query language Structured
ng Model language (Pig Reduce (HiveQL, SQL- Query
Latin) functions like syntax) Language (SQL)
for relational
databases
Ease of Easier to write Requires Easier to use Highly
Use than extensive than abstracted and
MapReduce; programming MapReduce easy to use for
more user- (usually in with SQL-like querying
friendly with Java) syntax relational data
fewer lines of
code
Performan Less efficient Provides full Optimized for Highly
ce than MapReduce control over complex optimized for
in some cases, processing but queries and relational
but faster for requires large datasets, databases but
many operations manual but not as low- not designed for
due to optimization level as big data
optimization MapReduce processing
techniques
Use Case ETL jobs, data Complex, low- Data Data
processing, level tasks warehousing, manipulation,
batch requiring fine- analytical transactional
processing tasks grained control queries on queries in
in Hadoop over data large datasets relational
processing databases
Execution Pig is based on Directly Uses SQL engine in
Engine MapReduce but executes Map MapReduce or relational
abstracts it for and Reduce Tez for query database
easier use jobs on Hadoop execution in the management
clusters background systems
(RDBMS)
Data Works with flat Works with Works with Works with
Model files, text, and large datasets structured data structured data
structured data processed (tables, (tables, rows,
stored in through key- partitions) in columns) in
Hadoop value pairs Hadoop relational
databases
Support Supports Complex to Easy to write Simple and
for Joins different types implement joins with SQL- powerful join
of joins (e.g., joins manually like syntax in operations in
inner, outer) in MapReduce HiveQL SQL
with ease
Data Typically Data is Primarily uses Uses relational
Storage interacts with processed HDFS for data databases
HDFS (Hadoop across HDFS or storage (MySQL,
Distributed File other PostgreSQL,
System) distributed file etc.) for data
systems storage

HiveQL:
HiveQL (Hive Query Language) is the query language used in Apache Hive, a data
warehousing system built on top of Hadoop. It is a SQL-like language that enables
users to perform data manipulation, analysis, and querying of large datasets stored in
HDFS (Hadoop Distributed File System) and other storage systems integrated with
Hadoop. HiveQL allows users to interact with Hadoop through a familiar syntax,
making it easier for people with SQL backgrounds to work with big data.

Data Definition Operations (DDL):


1. CREATE DATABASE:
CREATE DATABASE my_database;

2. USE DATABASE:
USE my_database;

3. CREATE TABLE:
CREATE TABLE employees (
id INT,
name STRING,
salary FLOAT
);

4. CREATE EXTERNAL TABLE:


CREATE EXTERNAL TABLE employees (
id INT,
name STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/path/to/external/data';
5. DROP DATABASE:
DROP DATABASE my_database CASCADE;

6. DROP TABLE:
DROP TABLE employees;

7. ALTER TABLE:
ALTER TABLE employees ADD COLUMNS (age INT);

Data Manipulation Operations (DML):


1. INSERT INTO:
INSERT INTO employees VALUES (1, 'John Doe', 5000.0);

2. INSERT OVERWRITE:
INSERT OVERWRITE TABLE employees SELECT * FROM
temp_employees;

3. LOAD DATA:
LOAD DATA INPATH '/path/to/data.csv' INTO TABLE employees;

4. SELECT:
SELECT * FROM employees WHERE salary > 4000;

Querying Operations:
1. SELECT:
SELECT name, salary FROM employees WHERE salary > 4000;

2. GROUP BY:
SELECT department, AVG(salary) FROM employees GROUP BY
department;

3. HAVING:
SELECT department, COUNT(*) FROM employees GROUP BY department HAVING
COUNT(*) > 5;

4. ORDER BY:
SELECT * FROM employees ORDER BY salary DESC;

5. LIMIT:
SELECT * FROM employees LIMIT 10;

6. JOIN:
SELECT e.name, d.department_name
FROM employees e
JOIN departments d
ON e.department_id = d.department_id;

7. DISTINCT:
SELECT DISTINCT department FROM employees;

You might also like