0% found this document useful (0 votes)
16 views

Unit 5 Big Data

Uploaded by

21csa30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Unit 5 Big Data

Uploaded by

21csa30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT V HADOOP RELATED TOOLS

Hbase – data model and implementations – Hbase clients – Hbase examples – praxis. Pig –
Grunt – pig data model – Pig Latin – developing and testing Pig Latin scripts. Hive – data types
and file formats – HiveQL data definition – HiveQL data manipulation – HiveQL queries.

Hbase

Hadoop can perform only batch processing, and data will be accessed only in a sequential
manner. That means one has to search the entire dataset even for the simplest of jobs.

A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of data in a
single unit of time (random access).

What is HBase?

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).

It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.

One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.
HBase and HDFS

HDFS HBase
HDFS is a distributed file system suitable
HBase is a database built on top of the HDFS.
for storing large files.
HDFS does not support fast individual
HBase provides fast lookups for larger tables.
record lookups.
It provides high latency batch processing; It provides low latency access to single rows
no concept of batch processing. from billions of records (Random access).
HBase internally uses Hash tables and provides
It provides only sequential access of data. random access, and it stores the data in indexed
HDFS files for faster lookups.

Storage Mechanism in HBase

HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs. A table have multiple
column families and each column family can have any number of columns. Subsequent column
values are stored contiguously on the disk. Each cell value of the table has a timestamp. In
short, in an HBase:

• Table is a collection of rows.


• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.

Given below is an example schema of table in HBase.

Column Family Column Family Column Family Column Family


Rowid
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1
2
3

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data,
rather than as rows of data. Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database


It is suitable for Online Transaction Process It is suitable for Online Analytical
(OLTP). Processing (OLAP).
Such databases are designed for small number of Column-oriented databases are designed
rows and columns. for huge tables.
The following image shows column families in a column-oriented database:

HBase and RDBMS

HBase RDBMS
HBase is schema-less, it doesn't have the concept An RDBMS is governed by its schema,
of fixed columns schema; defines only column which describes the whole structure of
families. tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard
scalable. to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as
It is good for structured data.
structured data.

Features of HBase

• HBase is linearly scalable.


• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• It provides data replication across clusters.

Where to Use HBase

• Apache HBase is used to have random, real-time read/write access to Big Data.
• It hosts very large tables on top of clusters of commodity hardware.
• Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable
acts up on Google File System, likewise Apache HBase works on top of Hadoop and
HDFS.

Applications of HBase

• It is used whenever there is a need to write heavy applications.


• HBase is used whenever we need to provide fast random access to available data.
• Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
5.2 Data model and implementations

Apache HBase data model is distributed, multidimensional, persistent, and a sorted


amp that is index by the column key, row key, and timestamp and that is the reason Apache
HBase is also called a key-value storage system.

The following are the Data model terminology used in Apache HBase.

1. Table

Apache HBase organizes data into tables which are composed of character and easy to use
with the file system.

2. Row

Apache HBase stores its data based on rows and each row has its unique row key. The row
key is represented as a byte array.

3. Column Family

The column families are used to store the rows and it also provides the structure to store data
in Apache HBase. It is composed of characters and strings and can be used with a file system
path. Each row in the table will have the same columns family but a row doesn't need to be
stored in all of its column family.

4. Column Qualifier

A column qualifier is used to point to the data that is stored in a column family. It is always
represented as a byte.

5. Cell

The cell is the combination of the column family, row key, column qualifier, and generally, it
is called a cell's value.

6. Timestamp

The value which is stored in the cell are versioned and each version is identified by a version
number that is assigned during creation time. In case if we don't mention timestamp while
writing data then the current time is considered.

A sample table in Apache HBase should look like below.


5.2.1 Hbase Architecture

In HBase, tables are split into regions and are served by the region servers. Regions are
vertically divided by column families into “Stores”. Stores are saved as files in HDFS. Shown
below is the architecture of HBase.

Note: The term ‘store’ is used for regions to explain the storage structure.
Components of Apache HBase Architecture

HBase architecture has 3 important components- HMaster, Region Server and


ZooKeeper.

i. HMaster

• Manages and Monitors the Hadoop Cluster


• Performs Administration (Interface for creating, updating and deleting tables.)
• Controlling the failover
• DDL operations are handled by the HMaster
• Whenever a client wants to change the schema and change any of the metadata
operations, HMaster is responsible for all these operations.

ii. Region Server

Regions are nothing but tables that are split up and spread across the region servers.

Region server

The region servers have regions that -

• Communicate with the client and handle data-related operations.


• Handle read and write requests for all the regions under it.
• Decide the size of the region by following the region size thresholds.

When we take a deeper look into the region server, it contain regions and stores as shown
below:

The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is transferred
and saved in Hfiles as blocks and the memstore is flushed.
iii. Zookeeper

ZooKeeper is like a central coordinator. It helps HBase do a few important things:

1. Keeping Track of Cluster Info: It stores important information about the HBase
cluster, like where data is stored and which servers are working.
2. Choosing a Leader: When there are multiple HBase servers, ZooKeeper helps pick
one as the leader to manage everything smoothly.
3. Spotting Problems: ZooKeeper watches over the servers. If one stops working,
ZooKeeper alerts HBase so it can fix things quickly.
4. Managing Settings: It helps HBase adjust its settings based on what's happening in
the cluster.
5. Helping Clients Find Data: ZooKeeper tells clients where to look for data in the
HBase cluster.

5.3 Hbase clients

This describes the java client API for HBase that is used to perform CRUD operations on
HBase tables. HBase is written in Java and has a Java Native API. Therefore it provides
programmatic access to Data Manipulation Language (DML).

Class HBase Configuration

Adds HBase configuration files to a Configuration. This class belongs to the


org.apache.hadoop.hbase package.

Methods and description


S.No. Methods and Description

static org.apache.hadoop.conf.Configuration create()


1
This method creates a Configuration with HBase resources.

Class HTable

HTable is an HBase internal class that represents an HBase table. It is an implementation of


table that is used to communicate with a single HBase table. This class belongs to the
org.apache.hadoop.hbase.client class.

Constructors
S.No. Constructors and Description

1 HTable()
HTable(TableName tableName, ClusterConnection connection, ExecutorService
2 pool)
Using this constructor, you can create an object to access an HBase table.

Methods and description

S.No. Methods and Description

void close()
1
Releases all the resources of the HTable.
void delete(Delete delete)
2
Deletes the specified cells/row.
boolean exists(Get get)
3
Using this method, you can test the existence of columns in the table, as specified by
Get.
Result get(Get get)
4
Retrieves certain cells from a given row.
org.apache.hadoop.conf.Configuration getConfiguration()
5
Returns the Configuration object used by this instance.
TableName getName()
6
Returns the table name instance of this table.
HTableDescriptor getTableDescriptor()
7
Returns the table descriptor for this table.
byte[] getTableName()
8
Returns the name of this table.
void put(Put put)
9
Using this method, you can insert data into the table.

Class Put

This class is used to perform Put operations for a single row. It belongs to the
org.apache.hadoop.hbase.client package.

Constructors
S.No. Constructors and Description

1 Put(byte[] row)
Using this constructor, you can create a Put operation for the specified row.
Put(byte[] rowArray, int rowOffset, int rowLength)
2
Using this constructor, you can make a copy of the passed-in row key to keep local.
Put(byte[] rowArray, int rowOffset, int rowLength, long ts)
3
Using this constructor, you can make a copy of the passed-in row key to keep local.
Put(byte[] row, long ts)
4
Using this constructor, we can create a Put operation for the specified row, using a
given timestamp.
Methods
S.No. Methods and Description

Put add(byte[] family, byte[] qualifier, byte[] value)


1
Adds the specified column and value to this Put operation.
Put add(byte[] family, byte[] qualifier, long ts, byte[] value)
2
Adds the specified column and value, with the specified timestamp as its version to
this Put operation.
Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)
3
Adds the specified column and value, with the specified timestamp as its version to
this Put operation.
Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)
4
Adds the specified column and value, with the specified timestamp as its version to
this Put operation.

Class Get

This class is used to perform Get operations on a single row. This class belongs to the
org.apache.hadoop.hbase.client package.

Constructor
S.No. Constructor and Description

Get(byte[] row)
1
Using this constructor, you can create a Get operation for the specified row.
2 Get(Get get)

Methods
S.No. Methods and Description
Get addColumn(byte[] family, byte[] qualifier)
1
Retrieves the column from the specific family with the specified qualifier.
Get addFamily(byte[] family)
2
Retrieves all columns from the specified family.

Class Delete

This class is used to perform Delete operations on a single row. To delete an entire row,
instantiate a Delete object with the row to delete. This class belongs to the
org.apache.hadoop.hbase.client package.

Constructor
S.No. Constructor and Description

Delete(byte[] row)
1
Creates a Delete operation for the specified row.
Delete(byte[] rowArray, int rowOffset, int rowLength)
2
Creates a Delete operation for the specified row and timestamp.
Delete(byte[] rowArray, int rowOffset, int rowLength, long ts)
3
Creates a Delete operation for the specified row and timestamp.
Delete(byte[] row, long timestamp)
4
Creates a Delete operation for the specified row and timestamp.
Methods
S.No. Methods and Description

Delete addColumn(byte[] family, byte[] qualifier)


1
Deletes the latest version of the specified column.
Delete addColumns(byte[] family, byte[] qualifier, long timestamp)
2
Deletes all versions of the specified column with a timestamp less than or equal to the
specified timestamp.
Delete addFamily(byte[] family)
3
Deletes all versions of all columns of the specified family.
Delete addFamily(byte[] family, long timestamp)
4
Deletes all columns of the specified family with a timestamp less than or equal to the
specified timestamp.
Class Result

This class is used to get a single row result of a Get or a Scan query.

Constructors
S.No. Constructors

Result()
1
Using this constructor, you can create an empty Result with no KeyValue payload;
returns null if you call raw Cells().
Methods
S.No. Methods and Description

byte[] getValue(byte[] family, byte[] qualifier)


1
This method is used to get the latest version of the specified column.
byte[] getRow()
2
This method is used to retrieve the row key that corresponds to the row from which
this Result was created.

5.4 Hbase examples


Here are a few examples of how you can use HBase:

1. Creating a Table:

TableName tableName = TableName.valueOf("myTable");


HTableDescriptor tableDescriptor = new HTableDescriptor(tableName);
tableDescriptor.addFamily(new HColumnDescriptor("cf1"));
tableDescriptor.addFamily(new HColumnDescriptor("cf2"));
admin.createTable(tableDescriptor);

Inserting Data:

TableName tableName = TableName.valueOf("myTable");


Put put = new Put(Bytes.toBytes("row1"));
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("col1"), Bytes.toBytes("value1"));
put.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("col2"), Bytes.toBytes("value2"));
Table table = connection.getTable(tableName);
table.put(put);

Getting Data:

TableName tableName = TableName.valueOf("myTable");


Get get = new Get(Bytes.toBytes("row1"));
get.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("col1"));
Table table = connection.getTable(tableName);
Result result = table.get(get);
byte[] value = result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("col1"));

Scanning Data:

TableName tableName = TableName.valueOf("myTable");


Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("col1"));
Table table = connection.getTable(tableName);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
byte[] value = result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("col1"));
// Process the retrieved value
}
scanner.close();

Deleting Data:

TableName tableName = TableName.valueOf("myTable");


Delete delete = new Delete(Bytes.toBytes("row1"));
delete.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("col1"));
Table table = connection.getTable(tableName);
table.delete(delete);

These examples demonstrate basic operations in HBase, including creating a table, inserting
data, retrieving data, scanning data, and deleting data. Remember to properly handle
exceptions and ensure you have established a connection to the HBase cluster using the
appropriate configuration settings.

Hbase has wide range of application in the following areas like

1. Medical
In medical field HBase used for storing genome sequences and running map reduce on it and
stores the disease history of patient.

2. Sports
In sports field Hbase used for storing match history for better analytics and prediction.

3. Web
It is also used to store user history and preferences for better customer targeting.
4. Oil and petroleum industry
HBase is used in the oil and petroleum industry to store exploration date for analysis and
predict probable places where oil can be found.

5. E-commerce
It is used for recording and storing logs about customer search history and to perform
analytics the and then target advertisement for better business.

Some of the companies that are using HBase

Facebook

Pinterest

Aadhaar (UIDAI)

Twitter

Yahoo

5.3 Pig

Pig is a scripting platform that runs on Hadoop clusters designed to process and
analyze large datasets. Pig is extensible, self-optimizing, and easily programmed.

Programmers can use Pig to write data transformations without knowing Java. Pig uses both
structured and unstructured data as input to perform analytics and uses HDFS to store the
results.

Components of Pig

There are two major components of the Pig:

• Pig Latin script language


• A runtime engine

Pig Latin script language

The Pig Latin script is a procedural data flow language. It contains syntax and commands that
can be applied to implement business logic. Examples of Pig Latin are LOAD and STORE.

A runtime engine
The runtime engine is a compiler that produces sequences of MapReduce programs. It uses
HDFS to store and retrieve data. It is also used to interact with the Hadoop system (HDFS
and MapReduce).

Data Model in Pig

As part of its data model, Pig supports four basic types.

1. Atom: It is a simple atomic value like int, long, double, or string.


2. Tuple: It is a sequence of fields that can be of any data type.
3. Bag: It is a collection of tuples of potentially varying structures and can contain
duplicates.
4. Map: It is an associative array.

Pig Commands

Given below in the table are some frequently used Pig Commands.

Command Function
load Reads data from the system
Store Writes data to file system
foreach Applies expressions to each record and outputs one or more records
filter Applies predicate and removes records that do not return true
Group/cogroup Collects records with the same key from one or more inputs
join Joins two or more inputs based on a key
order Sorts records based on a key
distinct Removes duplicate records
union Merges data sets
split Splits data into two or more sets based on filter conditions
stream Sends all records through a user-provided binary
dump Writes output to stdout
limit Limits the number of records

Applications of Apache Pig

Apache Pig is generally used by data scientists for performing tasks involving ad-hoc
processing and quick prototyping. Apache Pig is used −

• To process huge data sources such as web logs.


• To perform data processing for search platforms.
• To process time sensitive data loads.

Apache Pig – History

In 2006, Apache Pig was developed as a research project at Yahoo, especially to create and
execute MapReduce jobs on every dataset. In 2007, Apache Pig was open sourced via
Apache incubator. In 2008, the first release of Apache Pig came out. In 2010, Apache Pig
graduated as an Apache top-level project.

Apache Pig Architecture

The main reason why programmers have started using Hadoop Pig is that it converts the
scripts into a series of MapReduce tasks making their job easy. Below is the architecture of
Pig Hadoop:

Pig Hadoop framework has four main components:

1. Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by the
parser. The parser is responsible for checking the syntax of the script, along with other
miscellaneous checks. Parser gives an output in the form of a Directed Acyclic Graph
(DAG) that contains Pig Latin statements, together with other logical operators
represented as nodes.
2. Optimizer: After the output from the parser is retrieved, a logical plan for DAG is
passed to a logical optimizer. The optimizer is responsible for carrying out the logical
optimizations.
3. Compiler: The role of the compiler comes in when the output from the optimizer is
received. The compiler compiles the logical plan sent by the optimizing The logical
plan is then converted into a series of MapReduce tasks or jobs.
4. Execution Engine: After the logical plan is converted to MapReduce jobs, these jobs
are sent to Hadoop in a properly sorted order, and these jobs are executed on Hadoop
for yielding the desired result.

5.4 Grunt

Apache Pig Grunt is an interactive shell that enables users to enter Pig Latin interactively and
provides a shell to interact with HDFS and local file system commands. You can enter Pig
Latin commands directly into the Grunt shell for execution. Apache Pig starts executing the
Pig Latin language when it receives the STORE or DUMP command. Before executing the
command Pig Grunt shell do check the syntax and semantics to void any error.

To start Pig Grunt type :

$pig -x local

It will start Pig Grunt shell:

grunt>

HDFS commands in Pig Grunt

We can use the Pig Grunt shell to run HDFS commands as well. Starting from Pig version 0.5
all Hadoop fs shell commands are available to use. They are accessed using the keyword FS
followed by the command.

Let us see few HDFS commands from the Pig Grunt shell.

fs -ls /

This command will print all directories present in HDFS “/”.

Syntax:
grunt> fs subcommand subcommand_parameters;

Command:
grunt> fs -ls /

Output:

fs -cat

This command will print the content of a file present in HDFS.

Syntax:
grunt> fs subcommand subcommand_parameters
fs -mkdir

This command will create a directory in HDFS.


Syntax:
grunt> fs subcommand subcommand_parameters

Shell commands in Pig Grunt

We can use the Pig Grunt shell to run the basic shell command. We can invoke any shell
commands using sh.

Let us see few Shell commands from the Pig Grunt shell. We cannot execute those
commands which are part of the shell environment such as –cd.

sh ls

This command will list all directories/files.

Syntax:
grunt> sh subcommand subcommand_parameters

Command:
grunt> sh ls

Output:

sh cat

This command will print the content of a file.

Syntax:
grunt> sh subcommand subcommand_parameters

Utility commands in Pig Grunt

Pig Grunt supports utilities commands as well such as help, clear, history apart from this
Grunt also provides commands for controlling Pig and MapReduce such as exec, run, kill.
Help Command

Help command provides a list of Pig commands.

Syntax:
grunt> help

Clear Command

Clear command is used to clear the screen of the Grunt shell.

Syntax:
grunt> Clear

Command:
grunt> Clear

History Command

The history command is used to clear the screen of the Grunt shell.

Syntax:
grunt> history

Set Command

The SET command is used to assign values to keys that are case sensitive. In case the SET
command is used without providing arguments then all other system properties and
configurations are printed.

Syntax:
grunt> set [key 'value']

Kill Command

The kill command will attempt to kill any MapReduce jobs associated with the Pig job

Syntax:
grunt> kill JobId
5.6 Pig Latin

Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter,
we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types,
general and relational operators, and Pig Latin UDF’s.

Pig Latin – Data Model

The data model of Pig is fully nested. A Relation is the outermost structure of the Pig Latin
data model. And it is a bag where −

• A bag is a collection of tuples.


• A tuple is an ordered set of fields.
• A field is a piece of data.

Pig Latin – Statements

While processing data using Pig Latin, statements are the basic constructs.

• These statements work with relations. They include expressions and schemas.
• Every statement ends with a semicolon (;).
• We will perform various operations using operators provided by Pig Latin, through
statements.
• Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
• As soon as you enter a Load statement in the Grunt shell, its semantic checking will
be carried out. To see the contents of the schema, you need to use the Dump operator.
Only after performing the dump operation, the MapReduce job for loading the data
into the file system will be carried out.

Example

Given below is a Pig Latin statement, which loads data to Apache Pig.

grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as


( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Pig Latin – Data types

Given below table describes the Pig Latin data types.

S.N. Data Type Description & Example

Represents a signed 32-bit integer.


1 int
Example : 8
Represents a signed 64-bit integer.
2 long
Example : 5L
Represents a signed 32-bit floating point.
3 float
Example : 5.5F
Represents a 64-bit floating point.
4 double
Example : 10.5
Represents a character array (string) in Unicode UTF-8 format.
5 chararray
Example : ‘tutorials point’
6 Bytearray Represents a Byte array (blob).
Represents a Boolean value.
7 Boolean
Example : true/ false.
Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00
Represents a Java BigInteger.
9 Biginteger
Example : 60708090709
Represents a Java BigDecimal
10 Bigdecimal
Example : 185.98376256272893883
Complex Types

A tuple is an ordered set of fields.


11 Tuple
Example : (raja, 30)
A bag is a collection of tuples.
12 Bag
Example : {(raju,30),(Mohhammad,45)}
A Map is a set of key-value pairs.
13 Map
Example : [ ‘name’#’Raju’, ‘age’#30]

Pig Latin – Arithmetic Operators

The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b =
20.

Operator Description Example

+ Addition − Adds values on either side of the operator a + b will give 30

− Subtraction − Subtracts right hand operand from left hand a − b will give −10
operand
* Multiplication − Multiplies values on either side of the a * b will give 200
operator
/ Division − Divides left hand operand by right hand b / a will give 2
operand
% Modulus − Divides left hand operand by right hand b % a will give 0
operand and returns remainder
b = (a == 1)? 20: 30;
Bincond − Evaluates the Boolean operators. It has three
if a = 1 the value of
?: operands as shown below.
b is 20.
variable x = (expression) ? value1 if true : value2 if false.
if a!=1 the value of b
is 30.
CASE f2 % 2
CASE
WHEN 0 THEN
WHEN
'even'
Case − The case operator is equivalent to nested bincond
THEN operator.
WHEN 1 THEN
'odd'
ELSE
END
END

Pig Latin – Comparison Operators

The following table describes the comparison operators of Pig Latin.

Operator Description Example

== Equal − Checks if the values of two operands are equal or not; if (a = b) is not
yes, then the condition becomes true. true

!= Not Equal − Checks if the values of two operands are equal or (a != b) is true.
not. If the values are not equal, then condition becomes true.
Greater than − Checks if the value of the left operand is greater (a > b) is not
> than the value of the right operand. If yes, then the condition true.
becomes true.
Less than − Checks if the value of the left operand is less than
< the value of the right operand. If yes, then the condition becomes (a < b) is true.
true.
Greater than or equal to − Checks if the value of the left (a >= b) is not
>= operand is greater than or equal to the value of the right operand. true.
If yes, then the condition becomes true.
Less than or equal to − Checks if the value of the left operand is
<= less than or equal to the value of the right operand. If yes, then (a <= b) is true.
the condition becomes true.
f1 matches
matches Pattern matching − Checks whether the string in the left-hand
side matches with the constant in the right-hand side. '.*tutorial.*'

Pig Latin – Type Construction Operators

The following table describes the Type construction operators of Pig Latin.

Operator Description Example

() Tuple constructor operator − This operator is used to (Raju, 30)


construct a tuple.

{} Bag constructor operator − This operator is used to {(Raju, 30), (Mohammad,


construct a bag. 45)}

[] Map constructor operator − This operator is used to [name#Raja, age#30]


construct a tuple.

Pig Latin – Relational Operations

The following table describes the relational operators of Pig Latin.

Operator Description

Loading and Storing

To Load the data from the file system (local/HDFS) into a


LOAD
relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, GENERATE To generate data transformations based on columns of data.

STREAM To transform a relation using an external program.

Grouping and Joining


JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting

To arrange a relation in a sorted order based on one or more


ORDER
fields (ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

To view the logical, physical, or MapReduce execution plans to


EXPLAIN
compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.

5.8 Developing and testing Pig Latin scripts

Developing and testing Pig Latin scripts typically involves several steps. Pig Latin is a
scripting language used with Apache Pig, a platform for analyzing large data sets. Here's a
general guide on how to develop and test Pig Latin scripts:

1. Set up your environment: First, ensure that you have Apache Pig installed and
configured properly on your system. You can download Pig from the Apache Pig
website and follow the installation instructions.
2. Write your Pig Latin script: Create a Pig Latin script using a text editor or an
integrated development environment (IDE). Pig Latin scripts are typically written in
.pig files. These scripts consist of a series of statements that define the data flow and
transformations you want to apply to your data.
3. Understand the data: Before writing your script, understand the structure and format
of your data. Pig works well with structured and semi-structured data, such as CSV
files, JSON, or log files. Ensure that you have sample data available for testing your
script.
4. Test your script locally: Once you have written your Pig Latin script, you can test it
locally on a small sample of data to ensure that it behaves as expected. You can run
Pig in local mode using the pig command followed by the name of your script file.
For example:

pig myscript.pig

This will execute your script using a local MapReduce framework.

5. Debugging: If your script encounters errors during execution, use Pig's built-in
debugging features to identify and fix the issues. You can use the grunt shell to
interactively debug your script and inspect the intermediate results of each step.
6. Scale up testing: Once your script works correctly on a small sample of data, scale
up your testing by running it on larger datasets. You can do this by executing your
script on a Hadoop cluster using Pig's distributed mode. This allows you to test the
scalability and performance of your script under real-world conditions.
7. Optimize performance: If your script is slow or inefficient, optimize it for better
performance. This may involve restructuring your script, using built-in Pig
optimization techniques, or leveraging user-defined functions (UDFs) for custom
processing tasks.
8. Automate testing: Consider automating the testing of your Pig Latin scripts using
tools like Apache Oozie or Apache Airflow. This allows you to schedule and run your
scripts automatically, monitor their execution, and capture any errors or failures.
9. Document your script: Finally, document your Pig Latin script to make it easier for
others to understand and use. Include comments, annotations, and documentation that
explain the purpose of each step and how to run the script.

Hive

Hive is an open-source data warehouse infrastructure built on top of Apache Hadoop. It


provides a high-level query language, called HiveQL (similar to SQL), that allows users to
perform SQL-like queries on large datasets stored in Hadoop's distributed file system (HDFS)
or other compatible file systems. Hive translates HiveQL queries into MapReduce, Tez, or
Spark jobs, allowing users to leverage the power of Hadoop for data processing and analysis
Source.
Architecture of Hive

Hive chiefly consists of three core parts:

• Hive Clients: Hive offers a variety of drivers designed for communication with
different applications. For example, Hive provides Thrift clients for Thrift-based
applications. These clients and drivers then communicate with the Hive server, which
falls under Hive services.
• Hive Services: Hive services perform client interactions with Hive. For example, if a
client wants to perform a query, it must talk with Hive services.
• Hive Storage and Computing: Hive services such as file system, job client, and meta
store then communicates with Hive storage and stores things like metadata table
information and query results.

Hive's Features

These are Hive's chief characteristics:

• Hive is designed for querying and managing only structured data stored in tables
• Hive is scalable, fast, and uses familiar concepts
• Schema gets stored in a database, while processed data goes into a Hadoop Distributed
File System (HDFS)
• Tables and databases get created first; then data gets loaded into the proper tables
• Hive supports four file formats: ORC, SEQUENCEFILE, RCFILE (Record Columnar
File), and TEXTFILE
• Hive uses an SQL-inspired language, sparing the user from dealing with the complexity
of MapReduce programming. It makes learning more accessible by utilizing familiar
concepts found in relational databases, such as columns, tables, rows, and schema, etc.
• The most significant difference between the Hive Query Language (HQL) and SQL is
that Hive executes queries on Hadoop's infrastructure instead of on a traditional
database
• Since Hadoop's programming works on flat files, Hive uses directory structures to
"partition" data, improving performance on specific queries
• Hive supports partition and buckets for fast and simple data retrieval
• Hive supports custom user-defined functions (UDF) for tasks like data cleansing and
filtering. Hive UDFs can be defined according to programmers' requirements

Limitations of Hive

Of course, no resource is perfect, and Hive has some limitations. They are:

• Hive doesn’t support OLTP. Hive supports Online Analytical Processing (OLAP), but
not Online Transaction Processing (OLTP).
• It doesn’t support subqueries.
• It has a high latency.
• Hive tables don’t support delete or update operations.

Hive Modes

Depending on the size of Hadoop data nodes, Hive can operate in two different modes:

• Local mode
• Map-reduce mode

User Local mode when:

• Hadoop is installed under the pseudo mode, possessing only one data node
• The data size is smaller and limited to a single local machine
• Users expect faster processing because the local machine contains smaller datasets.

Use Map Reduce mode when:

• Hadoop has multiple data nodes, and the data is distributed across these different
nodes
• Users must deal with more massive data sets

MapReduce is Hive's default mode.

Data types and file formats

This chapter takes you through the different data types in Hive, which are involved in the
table creation. All the data types in Hive are classified into four types, given as follows:

• Column Types
• Literals
• Null Values
• Complex Types
Column Types

Column type are used as column data types of Hive. They are as follows:

Integral Types

Integer type data can be specified using integral data types, INT. When the data range
exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the
INT, you use SMALLINT. TINYINT is smaller than SMALLINT.

The following table depicts various INT data types:

Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

BIGINT L 10L

String Types

String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.

The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp

It supports traditional UNIX timestamp with optional nanosecond precision. It supports


java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.

Dates

DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.

Decimals

The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types

Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:

UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}

Literals

The following literals are used in Hive:

Floating Point Types

Floating point types are nothing but numbers with decimal points. Generally, this type of data
is composed of DOUBLE data type.

Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type. The range of decimal type is approximately -10-308 to 10308.

Null Value

Missing values are represented by the special value NULL.

Complex Types

The Hive complex data types are as follows:

Arrays

Arrays in Hive are used the same way they are used in Java.

Syntax: ARRAY<data_type>
Maps

Maps in Hive are similar to Java Maps.

Syntax: MAP<primitive_type, data_type>


Structs

Structs in Hive is similar to using complex data with comment.

Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>

File Formats:

Some special file formats that Hive can handle are available, such as:

• Text File Format


• Sequence File Format
• RC File (Row column file format)
• Avro Files
• ORC Files (Optimized Row Columnar file format)
• Parquet
• Custom INPUTFORMAT and OUTPUTFORMAT

Text File Format: This is a simple format where data is stored as plain text files. Each line
typically represents a record, and fields within the record are separated by delimiters such as
commas or tabs.

Sequence File Format: This is a binary file format optimized for storing key-value pairs. It's
commonly used in Hadoop MapReduce jobs and is efficient for large-scale data processing.

RC File (Row Column File Format): RC File is a columnar storage format designed to
optimize query performance by storing data in columnar fashion, making it suitable for
analytics and data warehousing workloads.

Avro Files: Avro is a data serialization system that provides rich data structures and compact
binary formats. Avro files are self-describing and support schema evolution, making them
suitable for data interchange and long-term storage.

ORC Files (Optimized Row Columnar File Format): ORC is another columnar storage
format developed specifically for Hive. It offers advanced compression techniques and
improved performance for complex queries, making it ideal for analytics and data warehousing
use cases.

Parquet: Parquet is a columnar storage format that is highly optimized for efficient data
storage and query performance. It supports advanced features such as nested data structures,
predicate pushdown, and column pruning, making it suitable for a wide range of analytical
workloads.

Custom INPUTFORMAT and OUTPUTFORMAT: Hive allows users to define custom


InputFormat and OutputFormat classes to handle specialized file formats or data sources that
are not natively supported. This flexibility enables integration with various external systems
and data formats.
HiveQL data definition

Hive DDL commands are the statements used for defining and changing the structure of a
table or database in Hive. It is used to build or modify the tables and other objects in the
database.

The several types of Hive DDL commands are:

1. CREATE
2. SHOW
3. DESCRIBE
4. USE
5. DROP
6. ALTER
7. TRUNCATE

1. CREATE DATABASE in Hive

The CREATE DATABASE statement is used to create a database in the Hive. The
DATABASE and SCHEMA are interchangeable. We can use either DATABASE or
SCHEMA.

Syntax:

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name


[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];

2. SHOW DATABASE in Hive

The SHOW DATABASES statement lists all the databases present in the Hive.

Syntax:

SHOW (DATABASES|SCHEMAS);
3. DESCRIBE DATABASE in Hive

The DESCRIBE DATABASE statement in Hive shows the name of Database in Hive, its
comment (if set), and its location on the file system.

The EXTENDED can be used to get the database properties.


Syntax:

DESCRIBE DATABASE/SCHEMA [EXTENDED] db_name;


4. USE DATABASE in Hive

The USE statement in Hive is used to select the specific database for a session on which all
subsequent HiveQL statements would be executed.

Syntax:

USE database_name;
5. DROP DATABASE in Hive

The DROP DATABASE statement in Hive is used to Drop (delete) the database.

The default behavior is RESTRICT which means that the database is dropped only when it is
empty. To drop the database with tables, we can use CASCADE.

Syntax:

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];


6. ALTER DATABASE in Hive

The ALTER DATABASE statement in Hive is used to change the metadata associated with
the database in Hive.

Syntax for changing Database Properties:

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES


(property_name=property_value, ...);

HiveQL data manipulation

Apache Hive DML stands for (Data Manipulation Language) which is used to insert, update,
delete, and fetch data from Hive tables. Using DML commands we can load files into Apache
Hive tables, write data into the filesystem from Hive queries, perform merge operation on the
table, and so on.

The following list of DML statements is supported by Apache Hive.

• LOAD

• SELECT

• INSERT
• DELETE

• UPDATE

• EXPORT

• IMPORT

LOAD: The LOAD statement is used to load data from external files (e.g., in HDFS) into a
Hive table. It's more of a data loading operation rather than a traditional DML statement. For
example:

sql
LOAD DATA LOCAL INPATH '/path/to/data/file' INTO TABLE mytable;

• SELECT: The SELECT statement is used for querying data from Hive tables. It retrieves
data and does not modify the underlying data.

• INSERT: Hive provides an INSERT INTO statement to insert data into tables. However, it
appends data to existing data rather than performing updates or modifications. It's more of an
insert operation than a traditional SQL INSERT statement:

sql
INSERT INTO mytable VALUES (1, 'John', 25);

• DELETE: Hive does support a DELETE statement for deleting rows from a table, but this
operation is limited to ACID-compliant Hive tables (e.g., ORC or Parquet) with transactions
enabled. It's not widely used in typical Hive workloads.

• UPDATE: Hive traditionally does not support a standard SQL UPDATE statement to modify
data in existing rows. Instead, you often need to work with Hive's batch processing model
and recreate tables with updated data.

• EXPORT and IMPORT: These are HiveQL extensions for exporting data from Hive
tables into external storage (EXPORT) or importing data from external storage into Hive
tables (IMPORT). They are not standard SQL DML statements.

HiveQL queries

In Apache Hive, HiveQL (Hive Query Language) is used to query data stored in Hive tables.
HiveQL provides a SQL-like syntax for querying structured and semi-structured data. Here
are some commonly used query types in HiveQL:

1. SELECT:
o Used to retrieve data from one or more tables.

Syntax:

SELECT column1 [, column2, ...]


FROM table_name
[JOIN table2_name ON join_condition]
[WHERE condition]
[GROUP BY column1 [, column2, ...]]
[HAVING condition]
[ORDER BY column1 [ASC|DESC] [, column2 [ASC|DESC], ...]]
[LIMIT n];

JOIN:

• Used to combine records from two or more tables based on a common column.

Syntax:

SELECT column1 [, column2, ...]


FROM table1_name
JOIN table2_name ON join_condition;

GROUP BY:

Used to group rows based on a specified column or columns.

Syntax:

SELECT column1 [, column2, ...], aggregate_function(column)


FROM table_name
GROUP BY column1 [, column2, ...];

HAVING:

Used to filter the result set based on aggregate function results.

Syntax:

SELECT column1 [, column2, ...], aggregate_function(column)


FROM table_name
GROUP BY column1 [, column2, ...]
HAVING condition;

ORDER BY:

Used to sort the result set based on one or more columns.


Syntax:

SELECT column1 [, column2, ...]


FROM table_name
ORDER BY column1 [ASC|DESC] [, column2 [ASC|DESC], ...];

LIMIT:

Used to restrict the number of rows returned in the result set.

Syntax:

SELECT column1 [, column2, ...]


FROM table_name
LIMIT n;

Subqueries:

Used to nest one query inside another query.

Syntax:

SELECT column1 [, column2, ...]


FROM table_name
WHERE column IN (SELECT column FROM table2_name WHERE condition);

Conditional Expressions:

Used to apply conditional logic in queries.

Syntax:

SELECT column1 [, column2, ...]


FROM table_name
WHERE column = value
OR column <> value
AND column > value
...

These are some of the common query types in HiveQL. They allow you to retrieve, filter,
aggregate, and sort data from Hive tables. HiveQL supports various SQL-like constructs,
making it easier to work with structured data in Hive.

You might also like