0% found this document useful (0 votes)
19 views

Unit 3

Uploaded by

Ramstage Testing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Unit 3

Uploaded by

Ramstage Testing
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

UNIT – 3 HDFS, HIVE AND HIVEQL, HBASE

STRUCTURE

3.0 Learning Objectives

3.1 Introduction

3.2 Overview of HDFS

3.2.1 Installation

3.2.2 Shell

3.2.3 Java API

3.3 Hive

3.3.1 Architecture

3.3.2 Installation

3.3.3 Comparison with Traditional Database

3.4 HiveQL

3.4.1 Querying Data

3.4.2 Sorting

3.4.3 Aggregating

3.4.4 Map Reduce Scripts

3.4.5 Joins

3.4.6 Sub queries

3.5 Concepts of HBase

3.5.1 Advanced Usage

3.5.2 Schema Design

3.5.3 Advance Indexing

3.6 PIG

3.7 Zookeeper

114
3.7.1 How it helps in monitoring a cluster?

3.7.2 Uses of HBase in Zookeeper

3.7.3 How to Build Applications with Zookeeper?

3.8 Distinguish between HDFS and HBase

3.9 Summary

3.10 Keywords

3.11 Learning Activity

3.12 Unit End Questions

3.13 References

3.0 LEARNING OBJECTIVES

After studying this unit, you will be able to:

 Describe the HDFS

 Define the Hive

 Explain the HiveQL

 Elucidate the HBase in Zookeeper

 Describe the PIG

3.1 INTRODUCTION

Thanks to big data, Hadoop has become a familiar term and has found its prominence in
today's digital world. When anyone can generate massive amounts of data with just one click,
the Hadoop framework is vital. Hadoop is an Apache open-source framework written in java
that allows distributed processing of large datasets across clusters of computers using simple
programming models. The Hadoop framework application works in an environment that
provides distributed storage and computation across clusters of computers.

115
Hadoop is designed to scale up from single server to thousands of machines, each offering
local computation and storage.

Back in the day, there was limited data generation. Hence, storing and processing data was
done with a single storage unit and a processor, respectively. In the blink of an eye, data
generation increases by leaps and bounds. Not only did it increase in volume but also its
variety. Therefore, a single processor was incapable of processing high volumes of different
varieties of data. Speaking of varieties of data, you can have structured, semi-structured and
unstructured data. Multiple machines help process data parallelly. However, the storage unit
became a bottleneck resulting in a network overhead generation. To address this issue, the
storage unit is distributed amongst each of the processors. The distribution resulted in storing
and accessing data efficiently and with no network overheads. This setup is how data
engineers and analysts manage big data effectively.

The core technique of storing files in storage lies in the file system that the operating
environment uses. Unlike common file systems, Hadoop uses a different file system that
deals with large datasets across a distributed network. It is called Hadoop Distributed File
System (HDFS). This article introduces the idea, with related background information to
begin with.

Hadoop Distributed File system – HDFS is the world’s most reliable storage system. HDFS is
a File system of Hadoop designed for storing very large files running on a cluster of
commodity hardware. It is designed on the principle of storage of a smaller number of large
files rather than the huge number of small files.

Hadoop HDFS provides a fault-tolerant storage layer for Hadoop and its other components.
HDFS Replication of data helps us to attain this feature. It stores data reliably, even in the
case of hardware failure. It provides high throughput access to application data by providing
the data access in parallel.

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large
datasets residing in distributed storage using SQL. Structure can be projected onto data
already in storage. A command line tool and JDBC driver are provided to connect users to
Hive.

116
This workbook contains some practical excises for researchers and/or data analysts who want
to run simple queries using Apache Hive. The queries in this document are the ones which
were used as part of the ‘What is Hive?’ webinar. A few of the simpler queries, which were
repeated for different tables, have been omitted for brevity.

3.2 OVERVIEW OF HDFS

The normal file system was designed to work on a single machine or single operating
environment. The datasets in Hadoop require storage capacity beyond what a single physical
machine can provide. Therefore, it becomes imperative to partition data across a number of
machines. This requires a special process to manage the files across the distributed network.
HDFS is the file system that specifically addresses this issue. This file system is more
complex than regular a file system because it has to deal with network programming,
fragmentation, fault tolerant, compatibility with local file system, and so forth. It empowers
Hadoop to run Big Data applications across multiple servers. It is characterized by being
highly fault tolerant with high data throughput across low-cost hardware. The objective of
HDFS file system is as follows:

 To deal with very large files

 The streaming data access to the file system must leverage a write once and read many
times pattern.

 Run on inexpensive commodity hardware

 It must leverage low latency data access.

 Support a massive number of files

 Support multiple file writers with arbitrary file modification

The smallest amount of data that is read and written onto a disk has something called block
size. Typically, the size of this block is 512 bytes and file system blocks are a few kilobytes.
HDFS works on the same principle, but the size of the block is much larger. The larger block
size leverages the search by minimizing seeks and therefore cost. These blocks are distributed
throughout something called clusters, which are nothing but blocks and copies of blocks on
different servers in the network. Individual files are replicated across servers in the cluster.

117
There are two types of nodes operating in the cluster in a master-slave pattern. The master
node is called name nodes and the worker node is called data nodes. It is through these nodes
HDFS maintains the file (and directory) system tree and metadata. In fact, a file is split into
blocks and stored in a subset of data nodes to spread across the cluster. The data node is
responsible for read, write, block creation, deletion, and replication requests in the file
system.

The name nodes, on the other hand, are servers that monitor access to the file system and
maintain data files in the HDFS. They map blocks to the data node and handles file/directory
open, close, and rename requests.

Data nodes are the core part of the file system and do the job of storage and retrieval of block
requests from the client. Name node is the maintainer to whom data nodes report. This means
that if name nodes are obliterated, the information about the files would be lost. Therefore,
Hadoop makes sure that the name node is resilient enough to withstand any kind of failure.
One technique to ensure that is to back it up in a secondary name node by periodically
merging the namespace image with the edit log. The secondary name node usually resides on
a separate machine to take over as the primary name node in case of a major failure.

There are many ways to interact with the HDFS file system, but the command line interface
is perhaps the simplest and most common. Hadoop can be installed onto one machine and run
to get a first-hand taste of it.

The HDFS file system operations are quite similar to the normal file system operations. Here
are some listings just to give an idea.

 Copies files from the local file system to HDFS

 Creates a directory in HDFS

 Lists files and directories in the current working directory in HDFS

There are five main advantages to using HDFS, including:

Cost Effectiveness

The Data Nodes that store the data rely on inexpensive off-the-shelf hardware, which cuts
storage costs. Also, because HDFS is open source, there's no licensing fee.

118
Large Data Set Storage

HDFS stores a variety of data of any size from megabytes to petabytes and in any format,
including structured and unstructured data.

Fast Recovery from Hardware Failure

HDFS is designed to detect faults and automatically recover on its own.

Portability

HDFS is portable across all hardware platforms, and it is compatible with several operating
systems, including Windows, Linux, and Mac OS/X.

Streaming data access

HDFS is built for high data throughput, which is best for access to streaming data.

3.2.1 Installation

Here you will learn how to successfully install Hadoop and configure the clusters which
could range from just a couple of nodes to even tens of thousands over huge clusters. So, for
that, first you need to install Hadoop on a single machine. The requirement for that is you
need to install Java if you don’t have it already on your system.

Getting Hadoop to work on the entire cluster involves getting the required software on all the
machines that are tied to the cluster. As per the norms one of the machines is associated with
the Name Node and another is associated with the Resource Manager.

The other services like The Map Reduce Job History and the Web App Proxy Server can be
hosted on specific machines or even on shared resources as per the requirement of the task or
load. All the other nodes in the entire cluster with have the dual nature of being both the
Node Manager and the Data Node. These are collectively termed as the slave nodes.

There are two ways to install Hadoop, i.e., Single node and Multi node.

Single node cluster means only one Data Node running and setting up all the Name Node,
Data Node, Resource Manager and Node Manager on a single machine. This is used for
studying and testing purposes. For example, let us consider a sample data set inside a
healthcare industry.

119
So, for testing whether the Oozy jobs have scheduled all the processes like collecting,
aggregating, storing, and processing the data in a proper sequence, we use single node cluster.
It can easily and efficiently test the sequential workflow in a smaller environment as
compared to large environments which contains terabytes of data distributed across hundreds
of machines.

While in a Multi node cluster, there are more than one Data Node running and each Data
Node is running on different machines. The multi node cluster is practically used in
organizations for analysing Big Data.

Considering the above example, in real time when we deal with petabytes of data, it needs to
be distributed across hundreds of machines to be processed. Thus, here we use multi node
cluster.

Step 1

Click here to download the Java 8 Package. Save this file in your home directory.

Step 2

Extract the Java Tar File.

Command: tar -xvf jdk-8u101-linux-i586.tar.gz

Fig 3.1 Hadoop installation – extracting java files…e

Step 3

Download the Hadoop 2.7.3 Package.

Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-


2.7.3.tar.gz

Fig 3.2 Hadoop installation – downloading Hadoop…

120
Step 4

Extract the Hadoop tar File.

Command: tar -xvf hadoop-2.7.3.tar.gz

Fig 3.3 Hadoop installation – extracting hadoop Ffles…

Step 5

Add the Hadoop and Java paths in the bash file (.bashrc).

Open. bashrc file. Now, add Hadoop and Java Path as shown below.

Command: vi .bashrc

Fig 3.4 Hadoop installation – setting environment variable…

Then, save the bash file and close it.

For applying all these changes to the current Terminal, execute the source command.

Command: source .bashrc

121
Fig 3.5 Hadoop installation – refreshing environment variables…

To make sure that Java and Hadoop have been properly installed on your system and can be
accessed through the Terminal, execute the java -version and hadoop version commands.

Command: java -version

Fig 3.6 Hadoop installation – checking java version

Command: hadoop version

Fig 3.7 Hadoop installation – checking Hadoop version...

Step 6

Edit the Hadoop Configuration files.

Command: cd hadoop-2.7.3/etc./hadoop/

Command: ls

All the Hadoop configuration files are located in hadoop-2.7.3/etc./hadoop directory as you
can see in the snapshot below:

122
Fig 3.8 Hadoop installation – Hadoop configuration files…

Step 7

Open core-site.xml and edit the property mentioned below inside configuration tag:

core-site.xml informs Hadoop daemon where Name Node runs in the cluster. It contains
configuration settings of Hadoop core such as I/O settings that are common to HDFS & Map
Reduce.

Command: vi core-site.xml

Fig 3.9 Hadoop installation – configuring core-site.xml… e

123
Step 8

Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:

hdfs-site.xml contains configuration settings of HDFS daemons (i.e., Name Node, Data
Node, and Secondary Name Node). It also includes the replication factor and block size of
HDFS.

Command: vi hdfs-site.xml

Fig 3.10 Hadoop installation – configuring hdfs-site.xml…

Step 9

Edit the mapred-site.xml file and edit the property mentioned below inside configuration tag:

Mapred-site.xml contains configuration settings of Map Reduce application like number of


JVM that can run in parallel, the size of the mapper and the reducer process, CPU cores
available for a process, etc.

124
In some cases, mapred-site.xml file is not available. So, we have to create the mapred-
site.xml file using mapred-site.xml template.

Command: cp mapred-site.xml.template mapred-site.xml

Command: vi mapred-site.xml.

Fig 3.11 Hadoop installation – configuring mapred-site.xml….

Step 10

Edit yarn-site.xml and edit the property mentioned below inside configuration tag:

Yarn-site.xml contains configuration settings of Resource Manager and Node Manager like
application memory management size, the operation needed on program & algorithm, etc.

Command: vi yarn-site.xml

125
Fig 3.12 Hadoop installation – configuring yarn-site.xml…

Step 11

Edit hadoop-env.sh and add the Java Path as mentioned below:

hadoop-env.sh contains the environment variables that are used in the script to run Hadoop
like Java home path, etc.

Command: vi hadoop–env.sh

Fig 3.13 Hadoop installation – configuring hadoop-env.sh…

126
Step 12

Go to Hadoop home directory and format the Name Node.

Command: cd

Command: cd hadoop-2.7.3

Command: bin/hadoop name node -format

Fig 3.14 Hadoop installation – formatting Name Node...

This formats the HDFS via Name Node. This command is only executed for the first time.
Formatting the file system means initializing the directory specified by the dfs.name.dir
variable.

Never format, up and running Hadoop file system. You will lose all your data stored in the
HDFS.

Step 13

Once the Name Node is formatted, go to hadoop-2.7.3/sbin directory and start all the
daemons.

Command: cd hadoop-2.7.3/sbin

Either you can start all daemons with a single command or do it individually.

Command: ./start-all.sh

The above command is a combination of start-dfs.sh, start-yarn.sh & mr-jobhistory-


daemon.sh

Or you can run all the services individually as below:

 Start Name Node

The Name Node is the centrepiece of an HDFS file system. It keeps the directory tree of
all files stored in the HDFS and tracks all the file stored across the cluster.

Command: ./hadoop-daemon.sh start name node

127
Fig 3.15 Hadoop installation – starting Name Node

 Start Data Node:

On start-up, a Data Node connects to the Name node and it responds to the requests from
the Name node for different operations.

Command: ./hadoop-daemon.sh start data node

Fig 3.16 Hadoop installation – starting Data Node

 Start Resource Manager:

Resource Manager is the master that arbitrates all the available cluster resources and thus
helps in managing the distributed applications running on the YARN system. Its work is
to manage each Node Managers and each application’s Application Master.

Command: ./yarn-daemon.sh start resource manager

Fig 3.17 Hadoop installation – starting Resource Manager

128
 Start Node Manager:

The Node Manager in each machine framework is the agent, which is responsible for
managing containers, monitoring their resource usage, and reporting the same to the
Resource Manager.

Command: ./yarn-daemon.sh start node manager

Fig 3.18 Hadoop installation – starting Node Manager

 Start JobHistoryServer:

JobHistoryServer is responsible for servicing all job history related requests from client.

Command: ./mr-jobhistory-daemon.sh start history server

Step 14

To check that all the Hadoop services are up and running, run the below command.

Command: jps

Fig 3.19 Hadoop installation – checking daemons

Step 15

Now open the Mozilla browser and go to localhost: 50070/dfshealth.html to check the Name
Node interface.

129
Fig 3.20 Check the Name Node interface.

There are definite advantages to installing and working with the open-source version of
Hadoop, even if you don’t actually use its Map Reduce features. For instance, if you really
want to understand the concept of map and reduce, learning how Hadoop does it will give
you a fairly deep understanding of it. You’ll also find that if you are running a Spark job,
putting data in the Hadoop Distributed File System, and giving your Spark workers access to
HDFS, can come in handy.

3.2.2 Shell

The most important operations of Hadoop Distributed File System using the shell commands
that is used for file management in the cluster. HDFS allows user data to be organized in the
form of files and directories. It provides a command line interface called FS shell that lets a
user interact with the data in HDFS. The syntax of this command set is similar to other shells
(e.g., bash, csh) that users are already familiar with.

There are 3 Possible Use Cases

 Running user interactive UI shell, inserting command by user

 Launching Shell with specific HDFS command

 Running in daemon mode - communication using UNIX domain sockets

130
Advantages UI against Direct Calling HDFS DFS Function

 HDFS DFS initiates JVM for each command call, HDFS Shell does it only once - which
means great speed enhancement when you need to work with HDFS more often

 Commands can be used in short way - eg. hdfs dfs -ls /, ls / - both will work

 HDFS path completion using TAB key

 You can easily add any other HDFS manipulation function

 There is a command history persisting in history log (~/.hdfs-shell/hdfs-shell.log)

 Support for relative directory + commands cd and pwd

 Advanced commands like su, groups, whoami

 Customizable shell prompt

Disadvantages UI against Direct Calling HDFS DFS Function

Commands cannot be piped, eg: calling ls /analytics | less is not possible at this time, you
have to use HDFS Shell in Daemon mode.

Configuring Launch Script(S) for your Environment

HDFS-Shell is a standard Java application. For its launch you need to define 2 things on your
class path:

 All ./lib/*.jar on class path (the dependencies. /lib are included in the binary bundle, or
they are located in Gradle build/distributions/*.zip)

 Path to directory with your Hadoop Cluster config files (hdfs-site.xml, core-site.xml etc.)
- without these files the HDFS Shell will work in local file system mode

 on Linux it's usually located in /etc./Hadoop/conf folder

 on Windows it's usually located in %HADOOP_HOME%\etc.\hadoop\ folder

Note that paths inside java -cp switch are separated by on Linux and on Windows.

Pre-defined launch scripts are located in the zip file. You can modify it locally as needed.

 for CLI UI run hdfs-shell.sh (without parameters) otherwise:

 HDFS Shell can be launched directly with the command to execute - after completion,
hdfs-shell will exit

131
 launch HDFS with hdfs-shell.sh script <file_path> to execute commands from file

 launch HDFS with hdfs-shell.sh xscript <file_path> to execute commands from file but
ignore command errors (skip errors)

Possible Commands inside Shell

 Type help to get list of all supported commands

 Clear or cls to clear screen

 Exit or quit or just q to exit the shell

 For calling system command type! <command>, eg.! Echo hello will call the system
command echo

 Type (hdfs) command only without any parameters to get its parameter description, e.g.,
ls only

 Script <file_path> to execute commands from file

 Xscript <file_path> to execute commands from file but ignore command errors (skip
errors)

Additional Commands

For our purposes we also integrated following commands:

 Set showresultcodeon and set showresultcodeoff - if it's enabled, it will write command
result code after its completion

 Cd, pwd

 Su <username> - experimental - changes current active user - it won't probably work on


secured HDFS (KERBEROS)

 Whoami - prints effective username

 Groups <username1 <username2,>> - eg. Groups hdfs prints groups for given users,
same as hdfs groups my_user my_user2 functionality

 Edit 'my file' - see the config below

132
Edit Command

Since the version 1.0.4 the simple command 'edit' is available. The command gets selected
file from HDFS to the local temporary directory and launches the editor. Once the editor
saves the file (with a result code 0), the file is uploaded back into HDFS (target file is
overwritten). By default, the editor path is taken from $EDITOR environment variable. If
$EDITOR is not set, vim (Linux, Mac) or notepad.exe (Windows) is used.

How to Change Command (Shell) Prompt

HDFS Shell supports customized bash-like prompt setting! I implemented support for these
switches listed in this table (include colors!, exclude \!, \#). You can also use this online
prompt generator to create prompt value of your wish. To setup your favorite prompt simply
adds export HDFS_SHELL_PROMPT="value" to you. bashrc (or set env variable on
Windows) and that's it. Restart HDFS Shell to apply change. Default value is currently set to
\e[36m\u@\h \e[0;39m\e[33m\w\e[0;39m\e[36m\\$ \e[37;0;39m.

FS Shell Commands

The Hadoop fs command runs a generic file system user client that interacts with the MapR
file system (MapR-FS).

View file listings

Usage: hadoop fs -ls hdfs :/

Check Memory Status

Usage: hadoop fs -df hdfs :/

Count of Directories, Files and Bytes in Specified Path and File Pattern

Usage: hadoop fs -count hdfs :/

133
Move File from One Location to Another

Usage: -mv <src> <dst>

Copy File from Source to Destination

Usage: -cp <src> <dst>

Delete File

Usage: -rm <path>

Put File from the Local file System to Hadoop Distributed File System:

Usage: -put <localsrc> … <dst>

Copy File from Local to HDFS

Usage: -copyFromLocal <localsrc> … <dst>

View File in Hadoop Distributed File System

Usage: -cat <src>

3.2.3 Java API

Java is an object-oriented programming language that runs on almost all electronic devices.
Java is platform-independent because of Java virtual machines (JVMs). It follows the
principle of "write once, run everywhere.” When a JVM is installed on the host operating
system, it automatically adapts to the environment and executes the program’s functionalities.
API stands for application program interface. A programmer writing an application program
can make a request to the Operating System using API (using graphical user interface or
command interface).

134
It is a set of routines, protocols and tools for building software and applications. It may be
any type of system like a web-based system, operating-system, or a database System.

APIs are important software components bundled with the JDK. APIs in Java include classes,
interfaces, and user Interfaces. They enable developers to integrate various applications and
websites and offer real-time information.

Types of Java APIs

There are four types of APIs in Java:

Public

Public (or open) APIs are Java APIs that come with the JDK. They do not have strict
restrictions about how developers use them.

Private

Private (or internal) APIs are developed by a specific organization and are accessible to only
employees who work for that organization.

Partner

Partner APIs are considered to be third-party APIs and are developed by organizations for
strategic business operations.

Composite

Composite APIs are microservices, and developers build them by combining several service
APIs.

Usage

 API in Procedural languages

It specifies a set of Functions and Routines that helps in completing a task.

 API in object-oriented languages

It simply shoes how an Objet work in a given Object-Oriented language. It is expressed


as a set of classes with an associated list of class Methods.

 API in libraries and Frameworks

It is related to Software libraries and Software Framework.

 API and Protocols

135
It can be implementation of Protocols.

 API sharing and reuse via virtual machine

Languages running on virtual machines can share an API.

3.3 HIVE

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. A data warehouse provides a central store of information that can easily be
analysed to make informed, data driven decisions. Hive allows users to read, write, and
manage petabytes of data using SQL.

Hive is built on top of Apache Hadoop, which is an open-source framework used to


efficiently store and process large datasets. As a result, Hive is closely integrated with
Hadoop, and is designed to work quickly on petabytes of data. What makes Hive unique is
the ability to query large datasets, leveraging Apache Tez or Map Reduce, with a SQL-like
interface.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic Map Reduce.

Benefits of Hive

 FAST

Hive is designed to quickly handle petabytes of data using batch processing.

 FAMILIAR

Hive provides a familiar, SQL-like interface that is accessible to non-programmers.

 SCALABLE

Hive is easy to distribute, and scale based on your needs.

Hive is not

 A relational database

 A design for Online Transaction Processing (OLTP)

136
 A language for real-time queries and row-level updates

Features of Hive

 It stores schema in a database and processed data into HDFS.

 It is designed for OLAP.

 It provides SQL type language for querying called HiveQL or HQL.

 It is familiar, fast, scalable, and extensible

3.3.1 Architecture

The following architecture explains the flow of submission of query into Hive.

Fig 3.21 Flow of submission of query into hive

Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++.
It supports different types of clients such as

 Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.

137
 JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.

 ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.

Hive Services

The following are the services provided by Hive

 Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.

 Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.

 Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column and
its type of information, the serializers and deserializers which is used to read and write
data and the corresponding HDFS files where the data is stored.

 Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.

 Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.

 Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements
into Map Reduce jobs.

 Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.

138
3.3.2 Installation

Prerequisites to successfully perform Hive Installation

Before you start the process of installing and configuring Hive, it is necessary to have the
following tools available in your local environment.

If not, you will need to have the below software for Hive to be working appropriately.

 Java

 Hadoop

 Yarn

 Apache Derby

Verifying JAVA Installation

Java must be installed on your system before installing Hive. Let us verify java installation
using the following command:

$ java –version

If Java is already installed on your system, you get to see the following response:

java version "1.7.0_71"

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)

Java Hotspot(TM) Client VM (build 25.0-b02, mixed mode)

Verifying Hadoop Installation

Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop
installation using the following command:

$ hadoop version

If Hadoop is already installed on your system, then you will get the following response:

Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r


1529768

Compiled by hortonmu on 2013-10-07T06:28Z

Compiled with protoc 2.5.0

From source with checksum 79e53ce7994d1628b240f09af91e1af4

139
Downloading Hive

We use hive-0.14.0 in this tutorial. You can download it by visiting the following link
http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the
/Downloads directory. Here, we download Hive archive named “apache-hive-0.14.0-
bin.tar.gz” for this tutorial. The following command is used to verify the download:

$ cd Downloads

$ ls

On successful download, you get to see the following response:

apache-hive-0.14.0-bin.tar.gz

Installing Hive

The following steps are required for installing Hive on your system. Let us assume the Hive
archive is downloaded onto the /Downloads directory.

 Extracting and verifying Hive Archive

The following command is used to verify the download and extract the hive archive:

$ tar zxvf apache-hive-0.14.0-bin.tar.gz

$ ls

On successful download, you get to see the following response:

apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

 Copying files to /usr/local/hive directory

We need to copy the files from the super user “su -”. The following commands are used
to copy the files from the extracted directory to the /usr/local/hive” directory.

$ su -

passwd:

# cd /home/user/Download

# mv apache-hive-0.14.0-bin /usr/local/hive

# exit

 Setting up environment for Hive

140
You can set up the Hive environment by appending the following lines to ~/.bashrc file:

export HIVE_HOME=/usr/local/hive

export PATH=$PATH:$HIVE_HOME/bin

export CLASSPATH=$CLASSPATH: /usr/local/Hadoop/lib/*:

export CLASSPATH=$CLASSPATH: /usr/local/hive/lib/*:

The following command is used to execute ~/.bashrc file.

$ source ~/.bashrc

Configuring Hive

To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in the
$HIVE_HOME/conf directory. The following commands redirect to Hive config folder and
copy the template file:

$ cd $HIVE_HOME/conf

$ cp hive-env.sh.template hive-env.sh

Edit the hive-env.sh file by appending the following line:

export HADOOP_HOME=/usr/local/hadoop

Hive installation is completed successfully. Now you require an external database server
to configure Metastore. We use Apache Derby database.

Verifying Derby Archive

The following commands are used for extracting and verifying the Derby archive:

$ tar zxvf db-derby-10.4.2.0-bin.tar.gz

$ ls

On successful download, you get to see the following response:

db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz

Configuring Metastore of Hive

Configuring Metastore means specifying to Hive where the database is stored. You can do
this by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of
all, copy the template file using the following command:

141
$ cd $HIVE_HOME/conf

$ cp hive-default.xml.template hive-site.xml

Edit hive-site.xml and append the following lines between the <configuration> and
</configuration> tags:

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>

<description>JDBC connect string for a JDBC metastore </description>

</property>

Create a file named jpox.properties and add the following lines into it:

javax.jdo.PersistenceManagerFactoryClass =

org.jpox.PersistenceManagerFactoryImpl

org.jpox.autoCreateSchema = false

org.jpox.validateTables = false

org.jpox.validateColumns = false

org.jpox.validateConstraints = false

org.jpox.storeManagerType = rdbms

org.jpox.autoCreateSchema = true

org.jpox.autoStartMechanismMode = checked

org.jpox.transactionIsolation = read_committed

javax.jdo.option.DetachAllOnCommit = true

javax.jdo.option.NontransactionalRead = true

javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver

javax.jdo.option.ConnectionURL =
jdbc:derby://hadoop1:1527/metastore_db;create = true

javax.jdo.option.ConnectionUserName = APP

javax.jdo.option.ConnectionPassword = mine

142
Verifying Hive Installation

Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS.
Here, we use the /user/hive/warehouse folder. You need to set write permission for these
newly created folders as shown below:

chmod g+w

Now set them in HDFS before verifying Hive. Use the following commands:

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse

$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp

$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

The following commands are used to verify Hive installation:

$ cd $HIVE_HOME

$ bin/hive

On successful installation of Hive, you get to see the following response:

Logging initialized using configuration in jar:file:/home/hadoop/hive-0.9.0/lib/hive-


common-0.9.0.jar!/hive-log4j.properties

Hive history file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt

………………….

hive>

The following sample command is executed to display all the tables:

hive> show tables;

OK

Time taken: 2.798 seconds

hive>

143
3.3.3 Comparison with Traditional Database

Hive Traditional database

Schema on READ – it’s does not verify the Schema on WRITE – table schema is
schema while it’s loaded the data enforced at data load time i.e. if the data
being loaded doesn’t conformed on schema
in that case it will rejected

It’s very easily scalable at low cost Not much Scalable, costly scale up.

It’s based on Hadoop notation that is Write In traditional database we can read and write
once and read many times many time

Record level updates is not possible in Hive Record level updates, insertions and
deletes, transactions and indexes are possible

OLTP (On-line Transaction Processing) is Both OLTP (On-line Transaction


not yet supported in Hive, but it’s Processing) and OLAP (On-line Analytical
supported OLAP (On-line Analytical Processing) are supported in RDBMS
Processing)

Table 3.1 Difference between hive and traditional database

3.4 HIVEQL

Hive is a data warehouse infrastructure and supports analysis of large datasets stored in
Hadoop's HDFS and compatible file systems. It provides an SQL (Structured Query
Language) like language called Hive Query Language (HiveQL).

The Hive Query Language (HiveQL) is a query language for Hive to process and analyse
structured data in a Metastore. It filters the data using the condition and gives you a finite
result. The built-in operators and functions generate an expression, which fulfils the
condition.

144
In addition, from the complexity of Map Reduce programming basically, Hive’s SQL-
inspired language separates the user. Also, it reuses familiar concepts from the relational
database world. Like tables, rows, columns, and schema, to ease learning.

Moreover, here most of the interactions tend to take place over a command line interface
(CLI).

However, there are four file formats which Hive supports. Such as TEXTFILE,
SEQUENCEFILE, ORC and RCFILE (Record Columnar File).

 Basically, for single user metadata storage Hive uses derby database.

 Whereas Hive uses MYSQL for multiple user Metadata or shared Metadata case.

Features of HiveQL

 Being a high-level language, Hive queries are implicitly converted to map-reduce jobs or
complex DAGs (directed acyclic graphs). Using the ‘Explain’ keyword before the query,
we can get the query plan.

 Faster query execution using Metadata storage in an RDMS format and replicates data,
making retrieval easy in case of loss.

 Bitmap indexing is done to accelerate queries.

 Enhances performance by allowing partitioning of data.

 Hive can process different types of compressed files, thus saving disk space.

 To manipulate strings, integers, or dates, HiveQL supports extending the user-defined


functions (UDF), to solve problems not supported by built-in UDFs.

 It provides a range of additional APIs, to build a customized query engine.

 Different file formats are supported like Text file, Sequence file, ORC (Optimised Row
Columnar), RCFile, Avro and Parquet. ORC file format is most suitable for improving
query performance as it stores data in the most optimized way, leading to faster query
execution.

 It is efficient data analytics and ETL tool for large datasets 10. Easy to write queries as it
is similar to SQL. DDL (Data definition language) commands in a hive are used to
specify and change the database or tables’ structure in a hive. These commands are drop,
create, truncate, alter, show, or describe.

145
Limitations

 Hive queries have higher latency as Hadoop is a batch-oriented system.

 Nested or sub-queries are not supported.

 Update or delete or insert operation cannot be performed at a record level.

 Real-time data processing or querying is not offered through the Hive Scope of HQL.

With petabytes of data, ranging from billions to trillions of records, HiveQL has a large scope
for big data professionals.

Scope of HiveQL

Below is how the scope of HiveQL widens and better serves to analyse humungous data
generated by users every day.

 Security: Along with processing large data, Hive provides data security. This task is
complex for the distributed system, as multiple components are needed to communicate
with each other. Kerberos authorization support allows authentication between client and
server.

 Locking: Traditionally, Hive lacks locking on rows, columns, or queries. Hive can
leverage Apache Zookeeper for locking support.

 Workflow Management: Apache Oozy is a workflow scheduler to automate various


HiveQL queries to execute sequentially or parallel.

 Visualization: Zeppelin notebook is a web-based notebook, which enables interactive


data analytics. It supports Hive and Sparks for data visualization and collaboration.

Conclusion

HiveQL is widely used across organizations to solve complex use cases. Keeping in mind the
features and limitations offered by the language, Hive query language is used in
telecommunications, healthcare, retail, banking, and financial services and even NASA’s Test
Propulsion Laboratory’s Climate evaluation system. Ease of writing SQL like queries and
commands accounts for wider acceptance. This field’s growing job opportunity lures fresher
and professionals from different sectors to gain hands-on experience and knowledge about
the field.

146
3.4.1 Querying Data

A query is a question or a request for information expressed in a formal manner. In computer


science, a query is essentially the same thing, the only difference is the answer or retrieved
information comes from a database.

A database query is either an action query or a select query. A select query is one that
retrieves data from a database. An action query asks for additional operations on data, such as
insertion, updating, deleting or other forms of data manipulation.

This doesn't mean that users just type in random requests. For a database to understand
demands, it must receive a query based on the predefined code. That code is a query
language.

Queries are one of the things that make databases so powerful. A "query" refers to the action
of retrieving data from your database. Usually, you will be selective with how much data you
want returned. If you have a lot of data in your database, you probably don't want to see
everything. More likely, you'll only want to see data that fits a certain criterion.

For example, you might only want to see how many individuals in your database live in a
given city. Or you might only want to see which individuals have registered with your
database within a given time period.

As with many other tasks, you can query a database either programmatically or via a user
interface.

Option 1: Programmatically

The way to retrieve data from your database with SQL is to use the SELECT statement.

Using the SELECT statement, you can retrieve all records...

SELECT * FROM Albums;

...or just some of the records:

SELECT * FROM Albums

WHERE ArtistId = 1;

The 2nd query only returns records where the value in the ArtistId column equals 1. So, if
there are say, three albums belonging to artist 1, then three records would be returned.

147
SQL is a powerful language, and the above statement is very simple. You can use SQL to
choose which columns you want to display, you could add further criteria, and you can even
query multiple tables at the same time. If you're interested in learning more about SQL, be
sure to check out the SQL tutorial after you've finished this one!

Option 2: User Interface

You might find the user interface easier to generate your queries, especially if they are
complex.

Database management systems usually offer a "design view" for your queries. Design view
enables you to pick and choose which columns you want to display and what criteria you'd
like to use to filter the data.

3.4.2 Sorting

Sorting refers to arranging data in a particular format. Sorting algorithm specifies the way to
arrange data in a particular order. Most common orders are in numerical or lexicographical
order.

The importance of sorting lies in the fact that data searching can be optimized to a very high
level if data is stored in a sorted manner. Sorting is also used to represent data in more
readable formats. Following are some of the examples of sorting in real-life scenarios –

 Telephone Directory –

The telephone directory stores the telephone numbers of people sorted by their names, so
that the names can be searched easily.

 Dictionary –

The dictionary stores words in an alphabetical order so that searching of any word
becomes easy.

In-place Sorting and Not-in-place Sorting

Sorting algorithms may require some extra space for comparison and temporary storage of
few data elements. These algorithms do not require any extra space and sorting is said to
happen in-place, or for example, within the array itself. This is called in-place sorting. Bubble
sort is an example of in-place sorting.

148
However, in some sorting algorithms, the program requires space which is more than or equal
to the elements being sorted. Sorting which uses equal or more space is called not-in-place
sorting. Merge-sort is an example of not-in-place sorting.

Stable and Not Stable Sorting

If a sorting algorithm, after sorting the contents, does not change the sequence of similar
content in which they appear, it is called stable sorting.

Fig 3. 22 Stable sorting

If a sorting algorithm, after sorting the contents, changes the sequence of similar content in
which they appear, it is called unstable sorting.

Fig 3. 23 Unstable sorting

Adaptive and Non-Adaptive Sorting Algorithm

A sorting algorithm is said to be adaptive, if it takes advantage of already 'sorted' elements in


the list that is to be sorted. That is, while sorting if the source list has some element already
sorted, adaptive algorithms will take this into account and will try not to re-order them.

149
A non-adaptive algorithm is one which does not take into account the elements which are
already sorted. They try to force every single element to be re-ordered to confirm their
sortedness.

Important Terms

Some terms are generally coined while discussing sorting techniques, here is a brief
introduction to them −

 Increasing Order

A sequence of values is said to be in increasing order if the successive element is greater


than the previous one. For example, 1, 3, 4, 6, 8, 9 are in increasing order, as every next
element is greater than the previous element.

 Decreasing Order

A sequence of values is said to be in decreasing order if the successive element is less


than the current one. For example, 9, 8, 6, 4, 3, 1 are in decreasing order, as every next
element is less than the previous element.

 Non-Increasing Order

A sequence of values is said to be in non-increasing order if the successive element is


less than or equal to its previous element in the sequence. This order occurs when the
sequence contains duplicate values. For example, 9, 8, 6, 3, 3, 1 are in non-increasing
order, as every next element is less than or equal to (in case of 3) but not greater than any
previous element.

 Non-Decreasing Order

A sequence of values is said to be in non-decreasing order if the successive element is


greater than or equal to its previous element in the sequence. This order occurs when the
sequence contains duplicate values. For example, 1, 3, 3, 6, 8, 9 are in non-decreasing
order, as every next element is greater than or equal to (in case of 3) but not less than the
previous one.

150
3.4.3 Aggregating

The Hive provides various in-built functions to perform mathematical and aggregate type
operations. In Hive, the aggregate function returns a single value resulting from computation
over many rows. Let’s see some commonly used aggregate functions: -

Return Operator Description


Type

BIGINT count(*) It returns the count of the number of rows present in the
file.

DOUBLE sum(col) It returns the sum of values.

DOUBLE sum(DISTINCT It returns the sum of distinct values.


col)

DOUBLE avg(col) It returns the average of values.

DOUBLE avg(DISTINCT It returns the average of distinct values.


col)

DOUBLE min(col) It compares the values and returns the minimum one
form it.

DOUBLE max(col) It compares the values and returns the maximum one
form it.

Fig 3.24 Commonly used aggregate functions

151
Examples of Aggregate Functions

Let's see an example to fetch the maximum salary of an employee.

hive> select max(Salary) from employee_data;

Fig 3.25 Example to fetch the maximum salary

3.4.4 Map Reduce Scripts

With the increasing popularity of big data applications, Map Reduce has become the standard
for performing batch processing on commodity hardware. However, Map Reduce code can be
quite challenging to write for developers, let alone data scientists and administrators.

Hive is a data warehousing framework that runs on top of Hadoop and provides an SQL
abstraction for Map Reduce apps. Data analysts and business intelligence officers need not
learn another complex programming language for writing Map Reduce apps. Hive will
automatically interpret any SQL query into a series of Map Reduce jobs.

Users can also plug in their own custom mappers and reducers in the data stream by using
features natively supported in the Hive language. e.g., in order to run a custom mapper script
- map_script - and a custom reducer script - reduce_script - the user can issue the following
command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.

By default, columns will be transformed to STRING and delimited by TAB before feeding to
the user script; similarly, all NULL values will be converted to the literal string \N in order to
differentiate NULL values from empty strings.

152
The standard output of the user script will be treated as TAB-separated STRING columns,
any cell containing only \N will be re-interpreted as a NULL, and then the resulting STRING
column will be cast to the data type specified in the table declaration in the usual way. User
scripts can output debug information to standard error which will be shown on the task detail
page on hadoop. These defaults can be overridden with ROW FORMAT.

3.4.5 Joins

JOIN is a clause that is used for combining specific fields from two tables by using values
common to each one. It is used to combine records from two or more tables in the database.

Syntax

join_table:

table_reference JOIN table_factor [join_condition]

| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference

join_condition

| table_reference LEFT SEMI JOIN table_reference join_condition

| table_reference CROSS JOIN table_reference [join_condition]

There are different types of joins given as follows:

 Join

 Left outer join

 Right outer join

 Full outer join

JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same
as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and
foreign keys of the tables.

153
3.4.6 Sub queries

A Query present within a Query is known as a sub query. The main query will depend on the
values returned by the sub queries.

Sub queries can be classified into two types

 Sub queries in FROM clause

 Sub queries in WHERE clause

When to use

 To get a particular value combined from two column values from different tables

 Dependency of one table values on other tables

 Comparative checking of one column values from other tables

Syntax

Sub query in FROM clause

SELECT <column names 1, 2…n>From (Sub Query) <TableName_Main >

Sub query in WHERE clause

SELECT <column names 1, 2…n> From<TableName_Main>WHERE col1 IN (Sub


Query);

Types

Below are types of sub queries that are supported

 Table Sub query - You can write the sub query in place of table name.

 Sub query in WHERE clause - These types of sub queries are widely used in HiveQL
queries and statements. You can either return the single value or multiple values from
the query from WHERE clause. If you are returning single values, use equality operator
otherwise IN operator.

 Correlated Sub query - Correlated sub queries are queries in which sub query refers to
the column from parent table clause.

154
3.5 CONCEPTS OF HBASE

HBase is a column-oriented non-relational database management system that runs on top of


Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing
sparse data sets, which are common in many big data use cases. It is well suited for real-time
data processing or random read/write access to large volumes of data.

Unlike relational database systems, HBase does not support a structured query language like
SQL; in fact, HBase isn’t a relational data store at all. HBase applications are written in
Java™ much like a typical Apache Map Reduce application. HBase does support writing
applications in Apache Avro, REST and Thrift.

3.5.1 Advanced Usage

An HBase system is designed to scale linearly. It comprises a set of standard tables with rows
and columns, much like a traditional database. Each table must have an element defined as a
primary key, and all access attempts to HBase tables must use this primary key.

HBase has two fundamental key structures: the row key and the column key. Both can be
used to convey meaning, by either the data they store, or by exploiting their sorting order.
HBase’s main unit of separation within a table is the column family—not the actual columns
as expected from a column-oriented database in their traditional sense. although you store
cells in a table format logically, in reality these rows are stored as linear sets of the actual
cells, which in turn contain all the vital information inside them.

The top-left part of the figure shows the logical layout of your data—you have rows and
columns. The columns are the typical HBase combination of a column family name and a
column qualifier, forming the column key. The rows also have a row key so that you can
address all columns in one logical row.

Avro, as a component, supports a rich set of primitive data types including numeric, binary
data and strings; and a number of complex types including arrays, maps, enumerations and
records. A sort order can also be defined for the data. HBase relies on Zoo Keeper for high-
performance coordination. Zoo Keeper is built into HBase, but if you’re running a production
cluster, it’s suggested that you have a dedicated Zoo Keeper cluster that’s integrated with
your HBase cluster. HBase works well with Hive, a query engine for batch processing of big
data, to enable fault-tolerant big data applications.

155
Example of HBase

An HBase column represents an attribute of an object; if the table is storing diagnostic logs
from servers in your environment, each row might be a log record, and a typical column
could be the timestamp of when the log record was written, or the server’s name where the
record originated.

HBase allows for many attributes to be grouped together into column families, such that the
elements of a column family are all stored together. This is different from a row-oriented
relational database, where all the columns of a given row are stored together. With HBase
you must predefine the table schema and specify the column families. However, new
columns can be added to families at any time, making the schema flexible and able to adapt
to changing application requirements.

Just as HDFS has a Name Node and slave nodes, and Map Reduce has Job Tracker and Task
Tracker slaves, HBase is built on similar concepts. In HBase a master node manages the
cluster and region servers store portions of the tables and perform the work on the data. In the
same way HDFS has some enterprise concerns due to the availability of the Name Node
HBase is also sensitive to the loss of its master node.

Features

 Linear and modular scalability.

 Strictly consistent reads and writes.

 Automatic and configurable sharding of tables

 Automatic failover support between Region Servers.

 Convenient base classes for backing Hadoop Map Reduce jobs with Apache HBase
tables.

 Easy to use Java API for client access.

 Block cache and Bloom Filters for real-time queries.

 Query predicate push down via server-side Filters

 Thrift gateway and a REST-full Web service that supports XML, Protobuf, and binary
data encoding options

 Extensible jruby-based (JIRB) shell

156
 Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or
via JMX

3.5.2 Schema Design

HBase’s data model is very different from what you have likely worked with or know of in
relational databases. As described in the original Big table paper, it’s a sparse, distributed,
persistent multidimensional sorted map, which is indexed by a row key, column key, and a
timestamp. You’ll hear people refer to it as a key-value store, a column-family-oriented
database, and sometimes a database storing versioned maps of maps. All these descriptions
are correct. This section touches upon these various concepts. The easiest and most naive way
to describe HBase’s data model is in the form of tables, consisting of rows and columns. This
is likely what you are familiar with in relational databases.

But that’s where the similarity between RDBMS data models and HBase ends. The HBase
schema design is very different compared to the relation database schema design. Below are
some of general concept that should be followed while designing schema in HBase:

 Row key: Each table in HBase table is indexed on row key. Data is sorted
lexicographically by this row key. There are no secondary indices available on HBase
table.

 Automaticity: Avoid designing table that requires atomicity across all rows. All
operations on HBase rows are atomic at row level.

 Even distribution: Read and write should uniformly distribute across all nodes available
in cluster. Design row key in such a way that, related entities should be stored in
adjacent rows to increase read efficacy.

A table in HBase consisting of two column families, Personal and Office, each is having two
columns. The entity that contains the data is called a cell. The rows are sorted based on the
row keys. A table in HBase would look like given Fig.

157
Fig 3.26 Table in HBase

3.5.3 Advance Indexing

Advanced Indexing likewise known as Secondary indexes are an orthogonal way to access
data from its primary access path. In HBase, you have a single index that is lexicographically
sorted on the primary row key. Access to records in any way other than through the primary
row requires scanning over potentially all the rows in the table to test them against your filter.
With secondary indexing, the columns, or expressions you index form an alternate row key to
allow point lookups and range scans along this new axis.

In HBase there are no indexes. The row key, column family, column qualifier is all stored in
sort order based on the java comparable method for byte arrays (everything is a byte array in
HBase). There is a bit of elegance in this simplicity. The downside is that the only access
pattern is based on the row key so if you have multiple use cases where the access patterns
are orthogonal to each other, the second use case means a full table scan.

If you create a second table, and invert the data from (row key, value) to a (value, row key),
you have what is known as an inverted table. That is to say, if I want to create an index on an
attribute in a record, I could just create a second table where the attribute is the row key and
then create a column entry for each row key of the base table.

As an example, if I was creating a table to manage my employees and their information, I


could create a secondary index on the manager's id. (In this case it’s a name). So, for each
record in the base table, there would be a corresponding entry in the index table.
With secondary indexing, I can either find a single row to find all of the rows that contain an
attribute with a specific value. Like everything in HBase, its stored in sort order so it becomes
rather trivial for the client to fetch several rows and join them in sort order, or to take the
intersection if we are trying to find the records that meet a specific qualification. (e.g., find all
of Bob's employees who live in Cleveland, OH. Or find the average cost of repairing a Volvo
S80 that was involved in a front-end collision....)

As more people want to use HBase like a database and apply SQL, using secondary indexing
makes filtering and doing data joins much more efficient. One just takes the intersection of
the indexed qualifiers specified, and then applies the unindexed qualifiers as filters further
reducing the result set.

3.6 PIG

Apache Pig is a platform for analysing large data sets that consists of a high-level language
for expressing data analysis programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that their structure is amenable to
substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences
of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g.,
the Hadoop subproject). Pig's language layer currently consists of a textual language called
Pig Latin, which has the following key properties:

 Ease of programming - It is trivial to achieve parallel execution of simple,


"embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple
interrelated data transformations are explicitly encoded as data flow sequences, making
them easy to write, understand, and maintain.

 Optimization opportunities - The way in which tasks are encoded permits the system
to optimize their execution automatically, allowing the user to focus on semantics rather
than efficiency.

 Extensibility - Users can create their own functions to do special-purpose processing.

159
Apache Pig is an abstraction over Map Reduce. It is a tool/platform which is used to analyse
larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can
perform all the data manipulation operations in Hadoop using Pig.

To write data analysis programs, Pig provides a high-level language known as Pig Latin. This
language provides various operators using which programmers can develop their own
functions for reading, writing, and processing data.

To analyse data using Apache Pig, programmers need to write scripts using Pig Latin
language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has
a component known as Pig Engine that accepts the Pig Latin scripts as input and converts
those scripts into Map Reduce jobs.

Features of Pig

Apache Pig comes with the following features −

 Rich set of operators − It provides many operators to perform operations like join, sort,
filer, etc.

 Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if
you are good at SQL.

 Optimization opportunities − The tasks in Apache Pig optimize their execution


automatically, so the programmers need to focus only on semantics of the language.

 Extensibility − Using the existing operators, users can develop their own functions to
read, process, and write data.

 UDF’s − Pig provides the facility to create User-defined Functions in other


programming languages such as Java and invoke or embed them in Pig Scripts.

 Handles all kinds of data − Apache Pig analyses all kinds of data, both structured as
well as unstructured. It stores the results in HDFS.

160
3.7 ZOOKEEPER

Apache Zookeeper is a coordination service for distributed application that enables


synchronization across a cluster. Zookeeper in Hadoop can be viewed as centralized
repository where distributed applications can put data and get data out of it. It is used to keep
the distributed system functioning together as a single unit, using its synchronization,
serialization, and coordination goals. For simplicity's sake Zookeeper can be thought of as a
file system where we have znodes that store data instead of files or directories storing data.
Zookeeper is a Hadoop Admin tool used for managing the jobs in the cluster.

The formal definition of Apache Zookeeper says that it is a distributed, open-source


configuration, synchronization service along with naming registry for distributed
applications. Apache Zookeeper is used to manage and coordinate large cluster of machines.
For example, Apache Storm which is used by Twitter for storing machine state data has
Apache Zookeeper as the coordinator between machines.

Why Do We Need Zookeeper in the Hadoop?

Distributed applications are difficult to coordinate and work with as they are much more error
prone due to huge number of machines attached to network. As many machines are involved,
race condition and deadlocks are common problems when implementing distributed
applications. Race condition occurs when a machine tries to perform two or more operations
at a time, and this can be taken care by serialization property of Zoo Keeper. Deadlocks are
when two or more machines try to access same shared resource at the same time. More
precisely they try to access each other’s resource which leads to lock of system as none of the
system is releasing the resource but waiting for other system to release it. Synchronization in
Zookeeper helps to solve the deadlock. Another major issue with distributed application can
be partial failure of process, which can lead to inconsistency of data. Zookeeper handles this
through atomicity, which means either whole of the process will finish or nothing will persist
after failure. Thus, Zookeeper is an important part of Hadoop that take care of these small but
important issues so that developer can focus more on functionality of the application.

161
Fig 3.27 Zookeeper in the hadoop

How Zoo Keeper in Hadoop Works?

Hadoop Zoo Keeper is a distributed application that follows a simple client-server model
where clients are nodes that make use of the service, and servers are nodes that provide the
service. Multiple server nodes are collectively called Zoo Keeper ensemble. At any given
time, one Zoo Keeper client is connected to at least one Zoo Keeper server. A master node is
dynamically chosen in consensus within the ensemble; thus usually, an ensemble of
Zookeeper is an odd number so that there is a majority of vote. If the master node fails,
another master is chosen in no time, and it takes over the previous master. Other than master
and slaves there are also observers in Zookeeper. Observers were brought in to address the
issue of scaling. With the addition of slaves, the write performance is going to be affected as
voting process is expensive. So, observers are slaves that do not take part into voting process
but have similar duties as other slaves.

Writes in Zookeeper

All the writes in Zookeeper go through the Master node, thus it is guaranteed that all writes
will be sequential. On performing write operation to the Zookeeper, each server attached to
that client persists the data along with master. Thus, this makes all the servers updated about
the data. However, this also means that concurrent writes cannot be made. Linear writes
guarantee can be problematic if Zookeeper is used for writing dominant workload. Zookeeper
in Hadoop is ideally used for coordinating message exchanges between clients, which
involves less writes and more reads. Zookeeper is helpful till the time the data is shared but if
application has concurrent data writing, then Zookeeper can come in way of the application
and impose strict ordering of operations.

162
Reads in Zookeeper

Zookeeper is best at reads as reads can be concurrent. Concurrent reads are done as each
client is attached to different server and all clients can read from the servers simultaneously,
although having concurrent reads leads to eventual consistency as master is not involved.
There can be cases where client may have an out-dated view, which gets updated with a little
delay.

3.7.1 How it helps in monitoring a cluster?

Apache Zookeeper is an open-source server that reliably coordinates distributed processes


and applications. It allows distributed processes to coordinate with each other through a
shared hierarchal namespace which is organized similarly to a standard file system.

Apache Zookeeper provides a hierarchical file system (with ZNodes as the system files) that
helps with the discovery, registration, configuration, locking, leader selection, queuing, etc.
of services working in different machines. Zoo Keeper server maintains configuration
information, naming, providing distributed synchronization, and providing group services,
used by distributed applications.

Applications Manager aims to help administrators manage their Zookeeper server - collect all
the metrics that can help with troubleshooting, display performance graphs and be alerted
automatically of potential issues. Let’s take a look at what you need to see to monitor
Zookeeper and the performance metrics to gather with Applications Manager. Let’s take a
look at what you need to see to monitor Zookeeper and the performance metrics to gather
with Applications Manager:

 Resource utilization details - Automatically discover Zookeeper Clusters, monitor


memory (heap and non-heap) on the znode get alerts of changes in resource
consumption.

 Thread and JVM usage - Track thread usage with metrics like Daemon, Peak and Live
Thread Count. Ensure that started threads don’t overload the server's memory.

 Performance Statistics - Gauge the amount of time it takes for the server to respond to a
client request, queued requests and connections in the server and performance
degradation due to network usage (client packets sent and received).

163
 Cluster and Configuration details - Track the number of Znodes, the watcher setup over
the nodes and the number of followers within the ensemble. Keep an eye on the leader
selection stats and client session times.

 Fix Performance Problems Faster - Get instant notifications when there are performance
issues with the components of Apache Zookeeper. Become aware of performance
bottlenecks and take quick remedial actions before your end users experience issues.

To create an Apache Zookeeper Monitor, follow the steps given below:

 Click on New Monitor link. Choose Apache Zookeeper.

 Enter Display Name of the monitor.

 Enter the IP Address or hostname of the host where zookeeper server runs.

 Enter the JMX Port of the Zookeeper server. By default, it will be 7199. Or Check in
zkServer.sh file for the JMX_PORT.

 To discover only this node and not all nodes in the cluster disable the option Discover all
nodes in the Cluster. By default, it is enabled which means all the nodes in the cluster are
discovered by default.

 Enter the credential details like username, password and JNDIPath or select credentials
from a Credential Manager list.

 Check Is Authentication Required field to give the jmx credentials to be used to connect
to the Zookeeper server.

 Enter the polling interval time in minutes.

 Click Test Credentials button if you want to test the access to Apache Zookeeper Server.

 Choose the Monitor Group from the combo box with which you want to associate
Apache Zookeeper Monitor (optional). You can choose multiple groups to associate your
monitor.

 Click Add Monitor(s). This discovers Apache Zookeeper from the network and starts
monitoring.

164
3.7.2 Uses of HBase in Zookeeper

HBase is a NoSQL data store that runs on top of your existing Hadoop cluster (HDFS). It
provides you capabilities like random, real-time reads/writes, which HDFS being a FS lacks.
Since it is a NoSQL data store it doesn't follow SQL conventions and terminologies. HBase
provides a good set of APIs (includes JAVA and Thrift). Along with this HBase also provides
seamless integration with Map Reduce framework. But, along with all these advantages of
HBase you should keep this in mind that random read-write is quick but always has
additional overhead. So, think well before ye make any decision.

Zoo Keeper is a high-performance coordination service for distributed applications (like


HBase). It exposes common services like naming, configuration management,
synchronization, and group services, in a simple interface so you don't have to write them
from scratch. You can use it off-the-shelf to implement consensus, group management, leader
election, and presence protocols. And you can build on it for your own, specific needs.

HBase relies completely on Zookeeper. HBase provides you the option to use its built-in
Zookeeper which will get started whenever you start HBAse. But it is not good if you are
working on a production cluster. In such scenarios it's always good to have a dedicated
Zookeeper cluster and integrate it with your HBase cluster. In Apache HBase, Zoo Keeper
coordinates, communicates, and shares state between the Masters and Region Servers. HBase
has a design policy of using Zoo Keeper only for transient data (that is, for coordination and
state communication). Thus, if the HBase’s Zoo Keeper data is removed, only the transient
operations are affected — data can continue to be written and read to/from HBase.

In short, the zookeeper is used to maintain the configuration information and communication
between region server and clients. It also provides distribution synchronization. It exposes
common services like naming, configuration management, and group services, in a simple
interface so you don't have to write them from scratch.

3.7.3 How to Build Applications with Zookeeper?

Zoo Keeper runs on a cluster of servers called an ensemble that shares the state of your data.
(These may be the same machines that are running other Hadoop services or a separate
cluster.) Whenever a change is made, it is not considered successful until it has been written
to a quorum (at least half) of the servers in the ensemble.

165
A leader is elected within the ensemble, and if two conflicting changes are made at the same
time, the one that is processed by the leader first will succeed and the other will fail. Zoo
Keeper guarantees that writes from the same client will be processed in the order they were
sent by that client. This guarantee, along with other features discussed below, allow the
system to be used to implement locks, queues, and other important primitives for distributed
queuing. The outcome of a write operation allows a node to be certain that an identical write
has not succeeded for any other node.

It is best to run your Zoo Keeper ensemble with an odd number of servers; typical ensemble
sizes are three, five, or seven. For instance, if you run five servers and three are down, the
cluster will be unavailable (so you can have one server down for maintenance and still
survive an unexpected failure). If you run six servers, however, the cluster is still unavailable
after three failures but the chance of three simultaneous failures is now slightly higher. Also
remember that as you add more servers, you may be able to tolerate more failures, but you
also may begin to have lower write throughput.

You need to have Java installed before running Zoo Keeper

You’ll then need to install the correct CDH4 package repository for your system and install
the zookeeper package (required for any machine connecting to Zoo Keeper) and the
zookeeper-server package (required for any machine in the Zoo Keeper ensemble).

The warnings you will see indicate that the first time Zoo Keeper is run on a given host; it
needs to initialize some storage space. You can do that as shown below and start a Zoo
Keeper server running in a single-node/standalone configuration.

$ sudo service zookeeper-server init

No myid provided, be sure to specify it in /var/lib/zookeeper/myid if using non-


standalone

$ sudo service zookeeper-server start

JMX enabled by default

Using config: /etc/zookeeper/conf/zoo.cfg

Starting zookeeper ... STARTED

Creating a znode is as easy as specifying the path and the contents. Create an empty znode to
serve as a parent ‘directory’, and another znode as its child:

166
[zk: localhost:2181(CONNECTED) 2] create /zk-demo ''

Created /zk-demo

[zk: localhost:2181(CONNECTED) 3] create /zk-demo/my-node 'Hello!'

Created /zk-demo/my-node

You can then read the contents of these znodes with the get command. The data contained in
the znode is printed on the first line, and metadata is listed afterwards.

Zoo Keeper can also notify you of changes in a znode’s content or changes in a znode’s
children. To register a “watch” on a znode’s data, you need to use the get or stat commands to
access the current content or metadata and pass an additional parameter requesting the watch.
To register a “watch” on a znode’s children, you pass the same parameter when getting the
children with ls.

[zk: localhost:2181(CONNECTED) 1] create /zk-demo/watch-this data

Created /watch-this

[zk: localhost:2181(CONNECTED) 2] get /zk-demo/watch-this true

data

<metadata>

Modify the same znode (either from the current Zoo Keeper client or a separate one), and you
will see the following message written to the terminal:

WATCHER

WatchedEvent state:SyncConnected type:NodeDataChanged path:/watch-this

Note that watches fire only once. If you want to be notified of changes in the future, you must
reset the watch each time it fires. Watches allow you to use zookeepers to implement
asynchronous, event-based systems and to notify nodes when their local copy of the data in
Zoo Keeper is stale.

167
3.8 DISTINGUISH BETWEEN HDFS AND HBASE

HDFS is fault-tolerant by design and supports rapid data transfer between nodes even during
system failures. HBase is a non-relational and open source Not-Only-SQL database that runs
on top of Hadoop. HBase comes under CP type of CAP (Consistency, Availability, and
Partition Tolerance) theorem.

HDFS is most suitable for performing batch analytics. However, one of its biggest drawbacks
is its inability to perform real-time analysis, the trending requirement of the IT industry.
HBase, on the other hand, can handle large data sets and is not appropriate for batch
analytics. Instead, it is used to write/read data from Hadoop in real-time.

Both HDFS and HBase are capable of processing structured, semi-structured as well as un-
structured data. HDFS lacks an in-memory processing engine slowing down the process of
data analysis as it is using plain old Map Reduce to do it. HBase, on the contrary, boasts of an
in-memory processing engine that drastically increases the speed of read/write.

HDFS is very transparent in its execution of data analysis. HBase, on the other hand, being a
NoSQL database in tabular format, fetches values by sorting them under different key values.

Use Case 1 – Cloudera Optimization for European Bank using HBase

HBase is ideally suited for real-time environments, and this can be best demonstrated by
citing the example of our client, a renowned European bank. To derive critical insights from
the logs from application/web servers, we implemented solution in Apache Storm and
Apache HBase together. Given the huge velocity of data, we opted for HBase over HDFS as
HDFS does not support real-time writes. The results were overwhelming; it reduced the query
time from 3 days to 3 minutes.

Use Case 2 – Analytics Solution for Global CPG Player using HDFS & Map Reduce

With our global beverage player client, the primary objective was to perform batch analytics
to gain SKU level insights and involved recursive/sequential calculations. HDFS and Map
Reduce frameworks were better suited than complex Hive queries on top of HBase. Map
Reduce was used for data wrangling and to prepare data for subsequent analytics. Hive was
used for custom analytics on top of data processed by Map Reduce. The results were
impressive as there was a drastic reduction in the time taken to generate custom analytics – 3
days to 3 hours.

168
HDFS HBase

HDFS is a Java-based file system utilized for HBase is a Java based Not Only SQL
storing large data sets. database

HDFS has a rigid architecture that does not


HBase allows for dynamic changes and
allow changes. It doesn’t facilitate dynamic
can be utilized for standalone applications.
storage.

HDFS is ideally suited for write-once and read- HBase is ideally suited for random write
many times use cases and read of data that is stored in HDFS.

Table 3.2 Difference between HDFS and HBase

3.9 SUMMARY

 Hadoop is an Apache open-source framework written in java that allows distributed


processing of large datasets across clusters of computers using simple programming
models.

 The normal file system was designed to work on a single machine or single operating
environment.

 The smallest amount of data that is read and written onto a disk has something called
block size.

 Getting Hadoop to work on the entire cluster involves getting the required software on
all the machines that are tied to the cluster. As per the norms one of the machines is
associated with the Name Node and another is associated with the Resource Manager.

 HDFS allows user data to be organized in the form of files and directories. It provides
a command line interface called FS shell that lets a user interact with the data in
HDFS.

 Java is an object-oriented programming language that runs on almost all electronic


devices. Java is platform-independent because of Java virtual machines (JVMs).

169
 APIs are important software components bundled with the JDK. APIs in Java include
classes, interfaces, and user Interfaces.

 Apache Hive is a distributed, fault-tolerant data warehouse system that enables


analytics at a massive scale.

 The Hive Query Language (HiveQL) is a query language for Hive to process and
analyse structured data in a Metastore.

 A query is a question or a request for information expressed in a formal manner. In


computer science, a query is essentially the same thing, the only difference is the
answer or retrieved information comes from a database.

 In Hive, the aggregate function returns a single value resulting from computation over
many rows.

3.10 KEYWORDS

 JDBC driver - JDBC Driver is a software component that enables java application to
interact with the database. The JDBC-ODBC bridge driver uses ODBC driver to
connect to the database. The JDBC-ODBC bridge driver converts JDBC method calls
into the ODBC function calls. This is now discouraged because of thin driver. It is
easy to use and can be easily connected to any database.

 Datasets – A data set (or dataset) is a collection of data. In the case of tabular data, a
data set corresponds to one or more database tables, where every column of a table
represents a particular variable, and each row corresponds to a given record of the
data set in question. The data set lists values for each of the variables, such as height
and weight of an object, for each member of the data set. Each value is known as a
datum. Data sets can also consist of a collection of documents or files.

 Name nodes - Name Node works as Master in Hadoop cluster. It Stores metadata of
actual data. It also manages File system namespace along with regulating client access
request for actual file data file. It is used to assign work to Slaves i.e., Data Nodes and
executes file system name space operation like opening/closing files, renaming files
and directories.

170
As Name node keep metadata in memory for fast retrieval, the huge amount of
memory is required for its operation. This should be hosted on reliable hardware.

 Shell Commands – The shell is the command interpreter on the Linux systems. It the
program that interacts with the users in the terminal emulation window. Shell
commands are instructions that instruct the system to do some action. A shell is a
computer program that presents a command line interface which allows you to control
your computer using commands entered with a keyboard instead of controlling
graphical user interfaces (GUIs) with a mouse/keyboard combination.

 UNIX Domain - A Unix domain socket or IPC socket (inter-process communication


socket) is a data communications endpoint for exchanging data between processes
executing on the same host operating system. The Unix domain socket facility is a
standard component of POSIX operating systems. The API for Unix domain sockets
is similar to that of an Internet socket, but rather than using an underlying network
protocol, all communication occurs entirely within the operating system kernel. Unix
domain sockets may use the file system as their address name space.

3.11 LEARNING ACTIVITY

1. Carry out a research on how Hive is different from HiveQL.

___________________________________________________________________________
___________________________________________________________________________

2. Collect certain facts and build an application with Zookeeper

___________________________________________________________________________
___________________________________________________________________________

171
3.12 UNIT END QUESTIONS

A. Descriptive Questions

Short Questions

1. What is HDFS? Give suitable examples.

2. Describe the five main advantages to using HDFS.

3. What are three possible use cases of Hadoop Shell commands?

4. What is Java API? Explain briefly.

5. Explain the role of HIVE?

Long Questions

1. Explain two types of nodes operating in the cluster in a master-slave pattern,

2. Mention the steps to install and configure the Hadoop on a machine.

3. Explain the Pros and Cons of UI against direct calling hdfs dfs function.

4. Enumerate the four types of APIs in Java.

5. Mention the features of HIVEQL.

B. Multiple Choice Questions

1. What is full form of HDFS?

a. Hadoop File System

b. Hadoop Field System

c. Hadoop File Search

d. Hadoop Field search

2. Which of the following are the Goals of HDFS?

a. Fault detection and recovery

b. Huge datasets

c. Hardware at data

d. All of these

172
3. What is Hive?

a. An open-source data warehouse system

b. relational database

c. OLTP

d. A language

4. Which of the following is correct statement?

a. HBase is a distributed column-oriented database

b. HBase is not open source

c. HBase is horizontally scalable.

d. Both A and C

5. Which of the following is incorrect statement?

a. Zoo Keeper is a distributed co-ordination service to manage large set of hosts.

b. Zoo Keeper allows developers to focus on core application logic without


worrying about the distributed nature of the application.

c. Zoo Keeper solves this issue with its simple architecture and API.

d. The Zoo Keeper framework was originally built at "Google" for accessing
their applications in an easy and robust manner

Answers

1-a, 2-d, 3-a, 4-d, 5-d

173
3.13 REFERENCES

Reference Books

 Jiwat Ram, C. Z. (2016). The Implications of Big Data Analytics on Business


Intelligence: A Qualitative Study in China, Procedia Computer Science, Volume 87,
221-226.

 Shabbir, M.Q., Gardezi, S.B.W. Application of Big Data analytics and organizational
performance: the mediating role of knowledge management practices. J Big Data 7,
47 (2020).

 García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2017). A comparison
on scalability for batch big data processing on Apache Spark and Apache Flink. Big
Data Analytics, 2(1). https://doi.org/10.1186/s41044-016-0020-2

Textbooks

 Sagiroglu S, Sinanc D. Big Data: a review. In: 2013 international conference on


collaboration technologies and systems (CTS). IEEE. 2013.

 Touil, M. (2019). Big Data: Spark Hadoop and Their databases. Independently
published.

 P. (2021). ORACLE DATA BASE. Advanced Administration with SQL. Scientific


Books.

Websites

 https://www.edureka.co/blog/install-hadoop-single-node-hadoop-cluster

 https://github.com/avast/hdfs-shell#advantages-ui-against-direct-calling-hdfs-dfs-
function

 https://www.javatpoint.com/api-full-form

174

You might also like