BDA-V
BDA-V
Pig: Installing and Running Pig, an Example, Comparision with Databases, Pig Latin,
User Defined Functions, Data Processing Operators, Hive: The Hive Shell, An Example,
Running Hive, Comparision with Traditional Databases, HiveQL, Tables, Querying Data
Installing Pig:
To install Apache Pig:
1. Install Java: Ensure Java is installed (sudo apt install openjdk-8-jdk for Ubuntu).
2. Install Hadoop: Install Hadoop if not already installed.
3. Download Apache Pig: Get the latest release from Apache Pig's website.
4. Extract the Archive: Unzip the downloaded tar.gz file (tar -xvf pig-0.x.x.tar.gz).
5. Set Environment Variables: Add export PIG_HOME=/opt/pig and export
PATH=$PATH:$PIG_HOME/bin to ~/.bashrc or ~/.bash_profile.
6. Verify Installation: Run pig to check if it's working.
Running Pig:
To run Pig files:
1. Create a Pig script: Write your Pig script with .pig extension.
1. Script
Pig can run a script file that contains Pig commands. For example, pig script.pig runs
the commands in the local file script.pig. Alternatively, for very short scripts, you can
use the -e option to run a script specified as a string on the command line.
2. Grunt
Grunt is an interactive shell for running Pig commands. Grunt is started when no file is
specified for Pig to run, and the -e option is not used. It is also possible to run Pig
scripts from within Grunt using run and exec.
Example: pig
3. Embedded
You can run Pig programs from Java using the PigServer class, much like you can use
JDBC to run SQL programs from Java. For programmatic access to Grunt, use
PigRunner.
Command: You can embed Pig scripts within Java code using the PigServer class.
Example:
PigServer pigServer = new PigServer(ExecType.LOCAL);
pigServer.registerQuery("your Pig command");
Pig Latin:
Pig Latin is a high-level data flow language used in Apache Pig for processing and
analyzing large datasets in Hadoop. It simplifies the process of writing MapReduce
programs by providing a language that's easier to write and read.
Comments
Comments in Pig Latin can be written using:
o Single-line comments: Prefixed with --.
-- This is a single-line comment
o Multi-line comments: Enclosed in /* */.
/* This is a
multi-line comment */
3. Dump: Prints the output of the data to the console (useful for small datasets).
o Example:
DUMP data;
Relational Operations
Pig provides various relational operations to manipulate data.
1. foreach: Applies a transformation to each tuple in the dataset.
o Example:
transformed_data = FOREACH data GENERATE name, age + 1;
1. Java UDF Code (Filter function to keep only people older than 30):
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.schema.Schema;
REGISTER 'AgeFilterUDF.jar';
data = LOAD 'people.txt' USING PigStorage(',') AS (name:chararray, age:int);
filtered_data = FILTER data BY AgeFilterUDF(age);
DUMP filtered_data;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
REGISTER 'BirthYearUDF.jar';
data = LOAD 'people.txt' USING PigStorage(',') AS (name:chararray, age:int);
birth_year_data = FOREACH data GENERATE name, age, BirthYearUDF(age) AS
birth_year;
DUMP birth_year_data;
Hive Shell:
The Hive Shell is an interactive command-line interface (CLI) used to interact with
Apache Hive. It provides a convenient way to run HiveQL queries, manage
databases and tables, and perform other operations within the Hive environment. The
Hive Shell is typically used by data analysts, data engineers, and administrators for
querying, managing, and analyzing data stored in Hadoop or HDFS (Hadoop
Distributed File System).
The Hive Shell allows users to execute HiveQL commands directly on the Hadoop
cluster, making it an essential tool for day-to-day interactions with Hive.
Launching the Hive Shell: To start the Hive Shell, you need to have Apache Hive
installed and configured on your machine or cluster. Once installed, you can launch
the Hive Shell by simply typing the following command in the terminal:
Example : hive
Running Hive:
1. Hive Shell: The Hive Shell provides an interactive command-line interface to
run HiveQL queries directly. You can start it by typing hive in the terminal, and it
will allow you to execute queries, create databases, manage tables, and more.
2. Batch Mode (Hive Scripts): Hive supports executing multiple queries stored
in a script file. You can run a script using the -f option, which allows for batch
processing of HiveQL commands.
Example: hive -f my_script.hql
HiveQL:
HiveQL (Hive Query Language) is the query language used in Apache Hive, a data
warehousing system built on top of Hadoop. It is a SQL-like language that enables
users to perform data manipulation, analysis, and querying of large datasets stored in
HDFS (Hadoop Distributed File System) and other storage systems integrated with
Hadoop. HiveQL allows users to interact with Hadoop through a familiar syntax,
making it easier for people with SQL backgrounds to work with big data.
2. USE DATABASE:
USE my_database;
3. CREATE TABLE:
CREATE TABLE employees (
id INT,
name STRING,
salary FLOAT
);
6. DROP TABLE:
DROP TABLE employees;
7. ALTER TABLE:
ALTER TABLE employees ADD COLUMNS (age INT);
2. INSERT OVERWRITE:
INSERT OVERWRITE TABLE employees SELECT * FROM
temp_employees;
3. LOAD DATA:
LOAD DATA INPATH '/path/to/data.csv' INTO TABLE employees;
4. SELECT:
SELECT * FROM employees WHERE salary > 4000;
Querying Operations:
1. SELECT:
SELECT name, salary FROM employees WHERE salary > 4000;
2. GROUP BY:
SELECT department, AVG(salary) FROM employees GROUP BY
department;
3. HAVING:
SELECT department, COUNT(*) FROM employees GROUP BY department HAVING
COUNT(*) > 5;
4. ORDER BY:
SELECT * FROM employees ORDER BY salary DESC;
5. LIMIT:
SELECT * FROM employees LIMIT 10;
6. JOIN:
SELECT e.name, d.department_name
FROM employees e
JOIN departments d
ON e.department_id = d.department_id;
7. DISTINCT:
SELECT DISTINCT department FROM employees;