Unit 3
Unit 3
STRUCTURE
3.1 Introduction
3.2.1 Installation
3.2.2 Shell
3.3 Hive
3.3.1 Architecture
3.3.2 Installation
3.4 HiveQL
3.4.2 Sorting
3.4.3 Aggregating
3.4.5 Joins
3.6 PIG
3.7 Zookeeper
114
3.7.1 How it helps in monitoring a cluster?
3.9 Summary
3.10 Keywords
3.13 References
3.1 INTRODUCTION
Thanks to big data, Hadoop has become a familiar term and has found its prominence in
today's digital world. When anyone can generate massive amounts of data with just one click,
the Hadoop framework is vital. Hadoop is an Apache open-source framework written in java
that allows distributed processing of large datasets across clusters of computers using simple
programming models. The Hadoop framework application works in an environment that
provides distributed storage and computation across clusters of computers.
115
Hadoop is designed to scale up from single server to thousands of machines, each offering
local computation and storage.
Back in the day, there was limited data generation. Hence, storing and processing data was
done with a single storage unit and a processor, respectively. In the blink of an eye, data
generation increases by leaps and bounds. Not only did it increase in volume but also its
variety. Therefore, a single processor was incapable of processing high volumes of different
varieties of data. Speaking of varieties of data, you can have structured, semi-structured and
unstructured data. Multiple machines help process data parallelly. However, the storage unit
became a bottleneck resulting in a network overhead generation. To address this issue, the
storage unit is distributed amongst each of the processors. The distribution resulted in storing
and accessing data efficiently and with no network overheads. This setup is how data
engineers and analysts manage big data effectively.
The core technique of storing files in storage lies in the file system that the operating
environment uses. Unlike common file systems, Hadoop uses a different file system that
deals with large datasets across a distributed network. It is called Hadoop Distributed File
System (HDFS). This article introduces the idea, with related background information to
begin with.
Hadoop Distributed File system – HDFS is the world’s most reliable storage system. HDFS is
a File system of Hadoop designed for storing very large files running on a cluster of
commodity hardware. It is designed on the principle of storage of a smaller number of large
files rather than the huge number of small files.
Hadoop HDFS provides a fault-tolerant storage layer for Hadoop and its other components.
HDFS Replication of data helps us to attain this feature. It stores data reliably, even in the
case of hardware failure. It provides high throughput access to application data by providing
the data access in parallel.
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large
datasets residing in distributed storage using SQL. Structure can be projected onto data
already in storage. A command line tool and JDBC driver are provided to connect users to
Hive.
116
This workbook contains some practical excises for researchers and/or data analysts who want
to run simple queries using Apache Hive. The queries in this document are the ones which
were used as part of the ‘What is Hive?’ webinar. A few of the simpler queries, which were
repeated for different tables, have been omitted for brevity.
The normal file system was designed to work on a single machine or single operating
environment. The datasets in Hadoop require storage capacity beyond what a single physical
machine can provide. Therefore, it becomes imperative to partition data across a number of
machines. This requires a special process to manage the files across the distributed network.
HDFS is the file system that specifically addresses this issue. This file system is more
complex than regular a file system because it has to deal with network programming,
fragmentation, fault tolerant, compatibility with local file system, and so forth. It empowers
Hadoop to run Big Data applications across multiple servers. It is characterized by being
highly fault tolerant with high data throughput across low-cost hardware. The objective of
HDFS file system is as follows:
The streaming data access to the file system must leverage a write once and read many
times pattern.
The smallest amount of data that is read and written onto a disk has something called block
size. Typically, the size of this block is 512 bytes and file system blocks are a few kilobytes.
HDFS works on the same principle, but the size of the block is much larger. The larger block
size leverages the search by minimizing seeks and therefore cost. These blocks are distributed
throughout something called clusters, which are nothing but blocks and copies of blocks on
different servers in the network. Individual files are replicated across servers in the cluster.
117
There are two types of nodes operating in the cluster in a master-slave pattern. The master
node is called name nodes and the worker node is called data nodes. It is through these nodes
HDFS maintains the file (and directory) system tree and metadata. In fact, a file is split into
blocks and stored in a subset of data nodes to spread across the cluster. The data node is
responsible for read, write, block creation, deletion, and replication requests in the file
system.
The name nodes, on the other hand, are servers that monitor access to the file system and
maintain data files in the HDFS. They map blocks to the data node and handles file/directory
open, close, and rename requests.
Data nodes are the core part of the file system and do the job of storage and retrieval of block
requests from the client. Name node is the maintainer to whom data nodes report. This means
that if name nodes are obliterated, the information about the files would be lost. Therefore,
Hadoop makes sure that the name node is resilient enough to withstand any kind of failure.
One technique to ensure that is to back it up in a secondary name node by periodically
merging the namespace image with the edit log. The secondary name node usually resides on
a separate machine to take over as the primary name node in case of a major failure.
There are many ways to interact with the HDFS file system, but the command line interface
is perhaps the simplest and most common. Hadoop can be installed onto one machine and run
to get a first-hand taste of it.
The HDFS file system operations are quite similar to the normal file system operations. Here
are some listings just to give an idea.
Cost Effectiveness
The Data Nodes that store the data rely on inexpensive off-the-shelf hardware, which cuts
storage costs. Also, because HDFS is open source, there's no licensing fee.
118
Large Data Set Storage
HDFS stores a variety of data of any size from megabytes to petabytes and in any format,
including structured and unstructured data.
Portability
HDFS is portable across all hardware platforms, and it is compatible with several operating
systems, including Windows, Linux, and Mac OS/X.
HDFS is built for high data throughput, which is best for access to streaming data.
3.2.1 Installation
Here you will learn how to successfully install Hadoop and configure the clusters which
could range from just a couple of nodes to even tens of thousands over huge clusters. So, for
that, first you need to install Hadoop on a single machine. The requirement for that is you
need to install Java if you don’t have it already on your system.
Getting Hadoop to work on the entire cluster involves getting the required software on all the
machines that are tied to the cluster. As per the norms one of the machines is associated with
the Name Node and another is associated with the Resource Manager.
The other services like The Map Reduce Job History and the Web App Proxy Server can be
hosted on specific machines or even on shared resources as per the requirement of the task or
load. All the other nodes in the entire cluster with have the dual nature of being both the
Node Manager and the Data Node. These are collectively termed as the slave nodes.
There are two ways to install Hadoop, i.e., Single node and Multi node.
Single node cluster means only one Data Node running and setting up all the Name Node,
Data Node, Resource Manager and Node Manager on a single machine. This is used for
studying and testing purposes. For example, let us consider a sample data set inside a
healthcare industry.
119
So, for testing whether the Oozy jobs have scheduled all the processes like collecting,
aggregating, storing, and processing the data in a proper sequence, we use single node cluster.
It can easily and efficiently test the sequential workflow in a smaller environment as
compared to large environments which contains terabytes of data distributed across hundreds
of machines.
While in a Multi node cluster, there are more than one Data Node running and each Data
Node is running on different machines. The multi node cluster is practically used in
organizations for analysing Big Data.
Considering the above example, in real time when we deal with petabytes of data, it needs to
be distributed across hundreds of machines to be processed. Thus, here we use multi node
cluster.
Step 1
Click here to download the Java 8 Package. Save this file in your home directory.
Step 2
Step 3
120
Step 4
Step 5
Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.
Command: vi .bashrc
For applying all these changes to the current Terminal, execute the source command.
121
Fig 3.5 Hadoop installation – refreshing environment variables…
To make sure that Java and Hadoop have been properly installed on your system and can be
accessed through the Terminal, execute the java -version and hadoop version commands.
Step 6
Command: cd hadoop-2.7.3/etc./hadoop/
Command: ls
All the Hadoop configuration files are located in hadoop-2.7.3/etc./hadoop directory as you
can see in the snapshot below:
122
Fig 3.8 Hadoop installation – Hadoop configuration files…
Step 7
Open core-site.xml and edit the property mentioned below inside configuration tag:
core-site.xml informs Hadoop daemon where Name Node runs in the cluster. It contains
configuration settings of Hadoop core such as I/O settings that are common to HDFS & Map
Reduce.
Command: vi core-site.xml
123
Step 8
Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e., Name Node, Data
Node, and Secondary Name Node). It also includes the replication factor and block size of
HDFS.
Command: vi hdfs-site.xml
Step 9
Edit the mapred-site.xml file and edit the property mentioned below inside configuration tag:
124
In some cases, mapred-site.xml file is not available. So, we have to create the mapred-
site.xml file using mapred-site.xml template.
Command: vi mapred-site.xml.
Step 10
Edit yarn-site.xml and edit the property mentioned below inside configuration tag:
Yarn-site.xml contains configuration settings of Resource Manager and Node Manager like
application memory management size, the operation needed on program & algorithm, etc.
Command: vi yarn-site.xml
125
Fig 3.12 Hadoop installation – configuring yarn-site.xml…
Step 11
hadoop-env.sh contains the environment variables that are used in the script to run Hadoop
like Java home path, etc.
Command: vi hadoop–env.sh
126
Step 12
Command: cd
Command: cd hadoop-2.7.3
This formats the HDFS via Name Node. This command is only executed for the first time.
Formatting the file system means initializing the directory specified by the dfs.name.dir
variable.
Never format, up and running Hadoop file system. You will lose all your data stored in the
HDFS.
Step 13
Once the Name Node is formatted, go to hadoop-2.7.3/sbin directory and start all the
daemons.
Command: cd hadoop-2.7.3/sbin
Either you can start all daemons with a single command or do it individually.
Command: ./start-all.sh
The Name Node is the centrepiece of an HDFS file system. It keeps the directory tree of
all files stored in the HDFS and tracks all the file stored across the cluster.
127
Fig 3.15 Hadoop installation – starting Name Node
On start-up, a Data Node connects to the Name node and it responds to the requests from
the Name node for different operations.
Resource Manager is the master that arbitrates all the available cluster resources and thus
helps in managing the distributed applications running on the YARN system. Its work is
to manage each Node Managers and each application’s Application Master.
128
Start Node Manager:
The Node Manager in each machine framework is the agent, which is responsible for
managing containers, monitoring their resource usage, and reporting the same to the
Resource Manager.
Start JobHistoryServer:
JobHistoryServer is responsible for servicing all job history related requests from client.
Step 14
To check that all the Hadoop services are up and running, run the below command.
Command: jps
Step 15
Now open the Mozilla browser and go to localhost: 50070/dfshealth.html to check the Name
Node interface.
129
Fig 3.20 Check the Name Node interface.
There are definite advantages to installing and working with the open-source version of
Hadoop, even if you don’t actually use its Map Reduce features. For instance, if you really
want to understand the concept of map and reduce, learning how Hadoop does it will give
you a fairly deep understanding of it. You’ll also find that if you are running a Spark job,
putting data in the Hadoop Distributed File System, and giving your Spark workers access to
HDFS, can come in handy.
3.2.2 Shell
The most important operations of Hadoop Distributed File System using the shell commands
that is used for file management in the cluster. HDFS allows user data to be organized in the
form of files and directories. It provides a command line interface called FS shell that lets a
user interact with the data in HDFS. The syntax of this command set is similar to other shells
(e.g., bash, csh) that users are already familiar with.
130
Advantages UI against Direct Calling HDFS DFS Function
HDFS DFS initiates JVM for each command call, HDFS Shell does it only once - which
means great speed enhancement when you need to work with HDFS more often
Commands can be used in short way - eg. hdfs dfs -ls /, ls / - both will work
Commands cannot be piped, eg: calling ls /analytics | less is not possible at this time, you
have to use HDFS Shell in Daemon mode.
HDFS-Shell is a standard Java application. For its launch you need to define 2 things on your
class path:
All ./lib/*.jar on class path (the dependencies. /lib are included in the binary bundle, or
they are located in Gradle build/distributions/*.zip)
Path to directory with your Hadoop Cluster config files (hdfs-site.xml, core-site.xml etc.)
- without these files the HDFS Shell will work in local file system mode
Note that paths inside java -cp switch are separated by on Linux and on Windows.
Pre-defined launch scripts are located in the zip file. You can modify it locally as needed.
HDFS Shell can be launched directly with the command to execute - after completion,
hdfs-shell will exit
131
launch HDFS with hdfs-shell.sh script <file_path> to execute commands from file
launch HDFS with hdfs-shell.sh xscript <file_path> to execute commands from file but
ignore command errors (skip errors)
For calling system command type! <command>, eg.! Echo hello will call the system
command echo
Type (hdfs) command only without any parameters to get its parameter description, e.g.,
ls only
Xscript <file_path> to execute commands from file but ignore command errors (skip
errors)
Additional Commands
Set showresultcodeon and set showresultcodeoff - if it's enabled, it will write command
result code after its completion
Cd, pwd
Groups <username1 <username2,>> - eg. Groups hdfs prints groups for given users,
same as hdfs groups my_user my_user2 functionality
132
Edit Command
Since the version 1.0.4 the simple command 'edit' is available. The command gets selected
file from HDFS to the local temporary directory and launches the editor. Once the editor
saves the file (with a result code 0), the file is uploaded back into HDFS (target file is
overwritten). By default, the editor path is taken from $EDITOR environment variable. If
$EDITOR is not set, vim (Linux, Mac) or notepad.exe (Windows) is used.
HDFS Shell supports customized bash-like prompt setting! I implemented support for these
switches listed in this table (include colors!, exclude \!, \#). You can also use this online
prompt generator to create prompt value of your wish. To setup your favorite prompt simply
adds export HDFS_SHELL_PROMPT="value" to you. bashrc (or set env variable on
Windows) and that's it. Restart HDFS Shell to apply change. Default value is currently set to
\e[36m\u@\h \e[0;39m\e[33m\w\e[0;39m\e[36m\\$ \e[37;0;39m.
FS Shell Commands
The Hadoop fs command runs a generic file system user client that interacts with the MapR
file system (MapR-FS).
Count of Directories, Files and Bytes in Specified Path and File Pattern
133
Move File from One Location to Another
Delete File
Put File from the Local file System to Hadoop Distributed File System:
Java is an object-oriented programming language that runs on almost all electronic devices.
Java is platform-independent because of Java virtual machines (JVMs). It follows the
principle of "write once, run everywhere.” When a JVM is installed on the host operating
system, it automatically adapts to the environment and executes the program’s functionalities.
API stands for application program interface. A programmer writing an application program
can make a request to the Operating System using API (using graphical user interface or
command interface).
134
It is a set of routines, protocols and tools for building software and applications. It may be
any type of system like a web-based system, operating-system, or a database System.
APIs are important software components bundled with the JDK. APIs in Java include classes,
interfaces, and user Interfaces. They enable developers to integrate various applications and
websites and offer real-time information.
Public
Public (or open) APIs are Java APIs that come with the JDK. They do not have strict
restrictions about how developers use them.
Private
Private (or internal) APIs are developed by a specific organization and are accessible to only
employees who work for that organization.
Partner
Partner APIs are considered to be third-party APIs and are developed by organizations for
strategic business operations.
Composite
Composite APIs are microservices, and developers build them by combining several service
APIs.
Usage
135
It can be implementation of Protocols.
3.3 HIVE
Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. A data warehouse provides a central store of information that can easily be
analysed to make informed, data driven decisions. Hive allows users to read, write, and
manage petabytes of data using SQL.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic Map Reduce.
Benefits of Hive
FAST
FAMILIAR
SCALABLE
Hive is not
A relational database
136
A language for real-time queries and row-level updates
Features of Hive
3.3.1 Architecture
The following architecture explains the flow of submission of query into Hive.
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++.
It supports different types of clients such as
Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
137
JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.
Hive Services
Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column and
its type of information, the serializers and deserializers which is used to read and write
data and the corresponding HDFS files where the data is stored.
Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements
into Map Reduce jobs.
Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
138
3.3.2 Installation
Before you start the process of installing and configuring Hive, it is necessary to have the
following tools available in your local environment.
If not, you will need to have the below software for Hive to be working appropriately.
Java
Hadoop
Yarn
Apache Derby
Java must be installed on your system before installing Hive. Let us verify java installation
using the following command:
$ java –version
If Java is already installed on your system, you get to see the following response:
Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop
installation using the following command:
$ hadoop version
If Hadoop is already installed on your system, then you will get the following response:
139
Downloading Hive
We use hive-0.14.0 in this tutorial. You can download it by visiting the following link
http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the
/Downloads directory. Here, we download Hive archive named “apache-hive-0.14.0-
bin.tar.gz” for this tutorial. The following command is used to verify the download:
$ cd Downloads
$ ls
apache-hive-0.14.0-bin.tar.gz
Installing Hive
The following steps are required for installing Hive on your system. Let us assume the Hive
archive is downloaded onto the /Downloads directory.
The following command is used to verify the download and extract the hive archive:
$ ls
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
We need to copy the files from the super user “su -”. The following commands are used
to copy the files from the extracted directory to the /usr/local/hive” directory.
$ su -
passwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
140
You can set up the Hive environment by appending the following lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
$ source ~/.bashrc
Configuring Hive
To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in the
$HIVE_HOME/conf directory. The following commands redirect to Hive config folder and
copy the template file:
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/usr/local/hadoop
Hive installation is completed successfully. Now you require an external database server
to configure Metastore. We use Apache Derby database.
The following commands are used for extracting and verifying the Derby archive:
$ ls
db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz
Configuring Metastore means specifying to Hive where the database is stored. You can do
this by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of
all, copy the template file using the following command:
141
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the <configuration> and
</configuration> tags:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
</property>
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass =
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL =
jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
142
Verifying Hive Installation
Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS.
Here, we use the /user/hive/warehouse folder. You need to set write permission for these
newly created folders as shown below:
chmod g+w
Now set them in HDFS before verifying Hive. Use the following commands:
$ cd $HIVE_HOME
$ bin/hive
………………….
hive>
OK
hive>
143
3.3.3 Comparison with Traditional Database
Schema on READ – it’s does not verify the Schema on WRITE – table schema is
schema while it’s loaded the data enforced at data load time i.e. if the data
being loaded doesn’t conformed on schema
in that case it will rejected
It’s very easily scalable at low cost Not much Scalable, costly scale up.
It’s based on Hadoop notation that is Write In traditional database we can read and write
once and read many times many time
Record level updates is not possible in Hive Record level updates, insertions and
deletes, transactions and indexes are possible
3.4 HIVEQL
Hive is a data warehouse infrastructure and supports analysis of large datasets stored in
Hadoop's HDFS and compatible file systems. It provides an SQL (Structured Query
Language) like language called Hive Query Language (HiveQL).
The Hive Query Language (HiveQL) is a query language for Hive to process and analyse
structured data in a Metastore. It filters the data using the condition and gives you a finite
result. The built-in operators and functions generate an expression, which fulfils the
condition.
144
In addition, from the complexity of Map Reduce programming basically, Hive’s SQL-
inspired language separates the user. Also, it reuses familiar concepts from the relational
database world. Like tables, rows, columns, and schema, to ease learning.
Moreover, here most of the interactions tend to take place over a command line interface
(CLI).
However, there are four file formats which Hive supports. Such as TEXTFILE,
SEQUENCEFILE, ORC and RCFILE (Record Columnar File).
Basically, for single user metadata storage Hive uses derby database.
Whereas Hive uses MYSQL for multiple user Metadata or shared Metadata case.
Features of HiveQL
Being a high-level language, Hive queries are implicitly converted to map-reduce jobs or
complex DAGs (directed acyclic graphs). Using the ‘Explain’ keyword before the query,
we can get the query plan.
Faster query execution using Metadata storage in an RDMS format and replicates data,
making retrieval easy in case of loss.
Hive can process different types of compressed files, thus saving disk space.
Different file formats are supported like Text file, Sequence file, ORC (Optimised Row
Columnar), RCFile, Avro and Parquet. ORC file format is most suitable for improving
query performance as it stores data in the most optimized way, leading to faster query
execution.
It is efficient data analytics and ETL tool for large datasets 10. Easy to write queries as it
is similar to SQL. DDL (Data definition language) commands in a hive are used to
specify and change the database or tables’ structure in a hive. These commands are drop,
create, truncate, alter, show, or describe.
145
Limitations
Real-time data processing or querying is not offered through the Hive Scope of HQL.
With petabytes of data, ranging from billions to trillions of records, HiveQL has a large scope
for big data professionals.
Scope of HiveQL
Below is how the scope of HiveQL widens and better serves to analyse humungous data
generated by users every day.
Security: Along with processing large data, Hive provides data security. This task is
complex for the distributed system, as multiple components are needed to communicate
with each other. Kerberos authorization support allows authentication between client and
server.
Locking: Traditionally, Hive lacks locking on rows, columns, or queries. Hive can
leverage Apache Zookeeper for locking support.
Conclusion
HiveQL is widely used across organizations to solve complex use cases. Keeping in mind the
features and limitations offered by the language, Hive query language is used in
telecommunications, healthcare, retail, banking, and financial services and even NASA’s Test
Propulsion Laboratory’s Climate evaluation system. Ease of writing SQL like queries and
commands accounts for wider acceptance. This field’s growing job opportunity lures fresher
and professionals from different sectors to gain hands-on experience and knowledge about
the field.
146
3.4.1 Querying Data
A database query is either an action query or a select query. A select query is one that
retrieves data from a database. An action query asks for additional operations on data, such as
insertion, updating, deleting or other forms of data manipulation.
This doesn't mean that users just type in random requests. For a database to understand
demands, it must receive a query based on the predefined code. That code is a query
language.
Queries are one of the things that make databases so powerful. A "query" refers to the action
of retrieving data from your database. Usually, you will be selective with how much data you
want returned. If you have a lot of data in your database, you probably don't want to see
everything. More likely, you'll only want to see data that fits a certain criterion.
For example, you might only want to see how many individuals in your database live in a
given city. Or you might only want to see which individuals have registered with your
database within a given time period.
As with many other tasks, you can query a database either programmatically or via a user
interface.
Option 1: Programmatically
The way to retrieve data from your database with SQL is to use the SELECT statement.
WHERE ArtistId = 1;
The 2nd query only returns records where the value in the ArtistId column equals 1. So, if
there are say, three albums belonging to artist 1, then three records would be returned.
147
SQL is a powerful language, and the above statement is very simple. You can use SQL to
choose which columns you want to display, you could add further criteria, and you can even
query multiple tables at the same time. If you're interested in learning more about SQL, be
sure to check out the SQL tutorial after you've finished this one!
You might find the user interface easier to generate your queries, especially if they are
complex.
Database management systems usually offer a "design view" for your queries. Design view
enables you to pick and choose which columns you want to display and what criteria you'd
like to use to filter the data.
3.4.2 Sorting
Sorting refers to arranging data in a particular format. Sorting algorithm specifies the way to
arrange data in a particular order. Most common orders are in numerical or lexicographical
order.
The importance of sorting lies in the fact that data searching can be optimized to a very high
level if data is stored in a sorted manner. Sorting is also used to represent data in more
readable formats. Following are some of the examples of sorting in real-life scenarios –
Telephone Directory –
The telephone directory stores the telephone numbers of people sorted by their names, so
that the names can be searched easily.
Dictionary –
The dictionary stores words in an alphabetical order so that searching of any word
becomes easy.
Sorting algorithms may require some extra space for comparison and temporary storage of
few data elements. These algorithms do not require any extra space and sorting is said to
happen in-place, or for example, within the array itself. This is called in-place sorting. Bubble
sort is an example of in-place sorting.
148
However, in some sorting algorithms, the program requires space which is more than or equal
to the elements being sorted. Sorting which uses equal or more space is called not-in-place
sorting. Merge-sort is an example of not-in-place sorting.
If a sorting algorithm, after sorting the contents, does not change the sequence of similar
content in which they appear, it is called stable sorting.
If a sorting algorithm, after sorting the contents, changes the sequence of similar content in
which they appear, it is called unstable sorting.
149
A non-adaptive algorithm is one which does not take into account the elements which are
already sorted. They try to force every single element to be re-ordered to confirm their
sortedness.
Important Terms
Some terms are generally coined while discussing sorting techniques, here is a brief
introduction to them −
Increasing Order
Decreasing Order
Non-Increasing Order
Non-Decreasing Order
150
3.4.3 Aggregating
The Hive provides various in-built functions to perform mathematical and aggregate type
operations. In Hive, the aggregate function returns a single value resulting from computation
over many rows. Let’s see some commonly used aggregate functions: -
BIGINT count(*) It returns the count of the number of rows present in the
file.
DOUBLE min(col) It compares the values and returns the minimum one
form it.
DOUBLE max(col) It compares the values and returns the maximum one
form it.
151
Examples of Aggregate Functions
With the increasing popularity of big data applications, Map Reduce has become the standard
for performing batch processing on commodity hardware. However, Map Reduce code can be
quite challenging to write for developers, let alone data scientists and administrators.
Hive is a data warehousing framework that runs on top of Hadoop and provides an SQL
abstraction for Map Reduce apps. Data analysts and business intelligence officers need not
learn another complex programming language for writing Map Reduce apps. Hive will
automatically interpret any SQL query into a series of Map Reduce jobs.
Users can also plug in their own custom mappers and reducers in the data stream by using
features natively supported in the Hive language. e.g., in order to run a custom mapper script
- map_script - and a custom reducer script - reduce_script - the user can issue the following
command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.
By default, columns will be transformed to STRING and delimited by TAB before feeding to
the user script; similarly, all NULL values will be converted to the literal string \N in order to
differentiate NULL values from empty strings.
152
The standard output of the user script will be treated as TAB-separated STRING columns,
any cell containing only \N will be re-interpreted as a NULL, and then the resulting STRING
column will be cast to the data type specified in the table declaration in the usual way. User
scripts can output debug information to standard error which will be shown on the task detail
page on hadoop. These defaults can be overridden with ROW FORMAT.
3.4.5 Joins
JOIN is a clause that is used for combining specific fields from two tables by using values
common to each one. It is used to combine records from two or more tables in the database.
Syntax
join_table:
join_condition
Join
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same
as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and
foreign keys of the tables.
153
3.4.6 Sub queries
A Query present within a Query is known as a sub query. The main query will depend on the
values returned by the sub queries.
When to use
To get a particular value combined from two column values from different tables
Syntax
Types
Table Sub query - You can write the sub query in place of table name.
Sub query in WHERE clause - These types of sub queries are widely used in HiveQL
queries and statements. You can either return the single value or multiple values from
the query from WHERE clause. If you are returning single values, use equality operator
otherwise IN operator.
Correlated Sub query - Correlated sub queries are queries in which sub query refers to
the column from parent table clause.
154
3.5 CONCEPTS OF HBASE
Unlike relational database systems, HBase does not support a structured query language like
SQL; in fact, HBase isn’t a relational data store at all. HBase applications are written in
Java™ much like a typical Apache Map Reduce application. HBase does support writing
applications in Apache Avro, REST and Thrift.
An HBase system is designed to scale linearly. It comprises a set of standard tables with rows
and columns, much like a traditional database. Each table must have an element defined as a
primary key, and all access attempts to HBase tables must use this primary key.
HBase has two fundamental key structures: the row key and the column key. Both can be
used to convey meaning, by either the data they store, or by exploiting their sorting order.
HBase’s main unit of separation within a table is the column family—not the actual columns
as expected from a column-oriented database in their traditional sense. although you store
cells in a table format logically, in reality these rows are stored as linear sets of the actual
cells, which in turn contain all the vital information inside them.
The top-left part of the figure shows the logical layout of your data—you have rows and
columns. The columns are the typical HBase combination of a column family name and a
column qualifier, forming the column key. The rows also have a row key so that you can
address all columns in one logical row.
Avro, as a component, supports a rich set of primitive data types including numeric, binary
data and strings; and a number of complex types including arrays, maps, enumerations and
records. A sort order can also be defined for the data. HBase relies on Zoo Keeper for high-
performance coordination. Zoo Keeper is built into HBase, but if you’re running a production
cluster, it’s suggested that you have a dedicated Zoo Keeper cluster that’s integrated with
your HBase cluster. HBase works well with Hive, a query engine for batch processing of big
data, to enable fault-tolerant big data applications.
155
Example of HBase
An HBase column represents an attribute of an object; if the table is storing diagnostic logs
from servers in your environment, each row might be a log record, and a typical column
could be the timestamp of when the log record was written, or the server’s name where the
record originated.
HBase allows for many attributes to be grouped together into column families, such that the
elements of a column family are all stored together. This is different from a row-oriented
relational database, where all the columns of a given row are stored together. With HBase
you must predefine the table schema and specify the column families. However, new
columns can be added to families at any time, making the schema flexible and able to adapt
to changing application requirements.
Just as HDFS has a Name Node and slave nodes, and Map Reduce has Job Tracker and Task
Tracker slaves, HBase is built on similar concepts. In HBase a master node manages the
cluster and region servers store portions of the tables and perform the work on the data. In the
same way HDFS has some enterprise concerns due to the availability of the Name Node
HBase is also sensitive to the loss of its master node.
Features
Convenient base classes for backing Hadoop Map Reduce jobs with Apache HBase
tables.
Thrift gateway and a REST-full Web service that supports XML, Protobuf, and binary
data encoding options
156
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or
via JMX
HBase’s data model is very different from what you have likely worked with or know of in
relational databases. As described in the original Big table paper, it’s a sparse, distributed,
persistent multidimensional sorted map, which is indexed by a row key, column key, and a
timestamp. You’ll hear people refer to it as a key-value store, a column-family-oriented
database, and sometimes a database storing versioned maps of maps. All these descriptions
are correct. This section touches upon these various concepts. The easiest and most naive way
to describe HBase’s data model is in the form of tables, consisting of rows and columns. This
is likely what you are familiar with in relational databases.
But that’s where the similarity between RDBMS data models and HBase ends. The HBase
schema design is very different compared to the relation database schema design. Below are
some of general concept that should be followed while designing schema in HBase:
Row key: Each table in HBase table is indexed on row key. Data is sorted
lexicographically by this row key. There are no secondary indices available on HBase
table.
Automaticity: Avoid designing table that requires atomicity across all rows. All
operations on HBase rows are atomic at row level.
Even distribution: Read and write should uniformly distribute across all nodes available
in cluster. Design row key in such a way that, related entities should be stored in
adjacent rows to increase read efficacy.
A table in HBase consisting of two column families, Personal and Office, each is having two
columns. The entity that contains the data is called a cell. The rows are sorted based on the
row keys. A table in HBase would look like given Fig.
157
Fig 3.26 Table in HBase
Advanced Indexing likewise known as Secondary indexes are an orthogonal way to access
data from its primary access path. In HBase, you have a single index that is lexicographically
sorted on the primary row key. Access to records in any way other than through the primary
row requires scanning over potentially all the rows in the table to test them against your filter.
With secondary indexing, the columns, or expressions you index form an alternate row key to
allow point lookups and range scans along this new axis.
In HBase there are no indexes. The row key, column family, column qualifier is all stored in
sort order based on the java comparable method for byte arrays (everything is a byte array in
HBase). There is a bit of elegance in this simplicity. The downside is that the only access
pattern is based on the row key so if you have multiple use cases where the access patterns
are orthogonal to each other, the second use case means a full table scan.
If you create a second table, and invert the data from (row key, value) to a (value, row key),
you have what is known as an inverted table. That is to say, if I want to create an index on an
attribute in a record, I could just create a second table where the attribute is the row key and
then create a column entry for each row key of the base table.
As more people want to use HBase like a database and apply SQL, using secondary indexing
makes filtering and doing data joins much more efficient. One just takes the intersection of
the indexed qualifiers specified, and then applies the unindexed qualifiers as filters further
reducing the result set.
3.6 PIG
Apache Pig is a platform for analysing large data sets that consists of a high-level language
for expressing data analysis programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that their structure is amenable to
substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences
of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g.,
the Hadoop subproject). Pig's language layer currently consists of a textual language called
Pig Latin, which has the following key properties:
Optimization opportunities - The way in which tasks are encoded permits the system
to optimize their execution automatically, allowing the user to focus on semantics rather
than efficiency.
159
Apache Pig is an abstraction over Map Reduce. It is a tool/platform which is used to analyse
larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can
perform all the data manipulation operations in Hadoop using Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin. This
language provides various operators using which programmers can develop their own
functions for reading, writing, and processing data.
To analyse data using Apache Pig, programmers need to write scripts using Pig Latin
language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has
a component known as Pig Engine that accepts the Pig Latin scripts as input and converts
those scripts into Map Reduce jobs.
Features of Pig
Rich set of operators − It provides many operators to perform operations like join, sort,
filer, etc.
Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if
you are good at SQL.
Extensibility − Using the existing operators, users can develop their own functions to
read, process, and write data.
Handles all kinds of data − Apache Pig analyses all kinds of data, both structured as
well as unstructured. It stores the results in HDFS.
160
3.7 ZOOKEEPER
Distributed applications are difficult to coordinate and work with as they are much more error
prone due to huge number of machines attached to network. As many machines are involved,
race condition and deadlocks are common problems when implementing distributed
applications. Race condition occurs when a machine tries to perform two or more operations
at a time, and this can be taken care by serialization property of Zoo Keeper. Deadlocks are
when two or more machines try to access same shared resource at the same time. More
precisely they try to access each other’s resource which leads to lock of system as none of the
system is releasing the resource but waiting for other system to release it. Synchronization in
Zookeeper helps to solve the deadlock. Another major issue with distributed application can
be partial failure of process, which can lead to inconsistency of data. Zookeeper handles this
through atomicity, which means either whole of the process will finish or nothing will persist
after failure. Thus, Zookeeper is an important part of Hadoop that take care of these small but
important issues so that developer can focus more on functionality of the application.
161
Fig 3.27 Zookeeper in the hadoop
Hadoop Zoo Keeper is a distributed application that follows a simple client-server model
where clients are nodes that make use of the service, and servers are nodes that provide the
service. Multiple server nodes are collectively called Zoo Keeper ensemble. At any given
time, one Zoo Keeper client is connected to at least one Zoo Keeper server. A master node is
dynamically chosen in consensus within the ensemble; thus usually, an ensemble of
Zookeeper is an odd number so that there is a majority of vote. If the master node fails,
another master is chosen in no time, and it takes over the previous master. Other than master
and slaves there are also observers in Zookeeper. Observers were brought in to address the
issue of scaling. With the addition of slaves, the write performance is going to be affected as
voting process is expensive. So, observers are slaves that do not take part into voting process
but have similar duties as other slaves.
Writes in Zookeeper
All the writes in Zookeeper go through the Master node, thus it is guaranteed that all writes
will be sequential. On performing write operation to the Zookeeper, each server attached to
that client persists the data along with master. Thus, this makes all the servers updated about
the data. However, this also means that concurrent writes cannot be made. Linear writes
guarantee can be problematic if Zookeeper is used for writing dominant workload. Zookeeper
in Hadoop is ideally used for coordinating message exchanges between clients, which
involves less writes and more reads. Zookeeper is helpful till the time the data is shared but if
application has concurrent data writing, then Zookeeper can come in way of the application
and impose strict ordering of operations.
162
Reads in Zookeeper
Zookeeper is best at reads as reads can be concurrent. Concurrent reads are done as each
client is attached to different server and all clients can read from the servers simultaneously,
although having concurrent reads leads to eventual consistency as master is not involved.
There can be cases where client may have an out-dated view, which gets updated with a little
delay.
Apache Zookeeper provides a hierarchical file system (with ZNodes as the system files) that
helps with the discovery, registration, configuration, locking, leader selection, queuing, etc.
of services working in different machines. Zoo Keeper server maintains configuration
information, naming, providing distributed synchronization, and providing group services,
used by distributed applications.
Applications Manager aims to help administrators manage their Zookeeper server - collect all
the metrics that can help with troubleshooting, display performance graphs and be alerted
automatically of potential issues. Let’s take a look at what you need to see to monitor
Zookeeper and the performance metrics to gather with Applications Manager. Let’s take a
look at what you need to see to monitor Zookeeper and the performance metrics to gather
with Applications Manager:
Thread and JVM usage - Track thread usage with metrics like Daemon, Peak and Live
Thread Count. Ensure that started threads don’t overload the server's memory.
Performance Statistics - Gauge the amount of time it takes for the server to respond to a
client request, queued requests and connections in the server and performance
degradation due to network usage (client packets sent and received).
163
Cluster and Configuration details - Track the number of Znodes, the watcher setup over
the nodes and the number of followers within the ensemble. Keep an eye on the leader
selection stats and client session times.
Fix Performance Problems Faster - Get instant notifications when there are performance
issues with the components of Apache Zookeeper. Become aware of performance
bottlenecks and take quick remedial actions before your end users experience issues.
Enter the IP Address or hostname of the host where zookeeper server runs.
Enter the JMX Port of the Zookeeper server. By default, it will be 7199. Or Check in
zkServer.sh file for the JMX_PORT.
To discover only this node and not all nodes in the cluster disable the option Discover all
nodes in the Cluster. By default, it is enabled which means all the nodes in the cluster are
discovered by default.
Enter the credential details like username, password and JNDIPath or select credentials
from a Credential Manager list.
Check Is Authentication Required field to give the jmx credentials to be used to connect
to the Zookeeper server.
Click Test Credentials button if you want to test the access to Apache Zookeeper Server.
Choose the Monitor Group from the combo box with which you want to associate
Apache Zookeeper Monitor (optional). You can choose multiple groups to associate your
monitor.
Click Add Monitor(s). This discovers Apache Zookeeper from the network and starts
monitoring.
164
3.7.2 Uses of HBase in Zookeeper
HBase is a NoSQL data store that runs on top of your existing Hadoop cluster (HDFS). It
provides you capabilities like random, real-time reads/writes, which HDFS being a FS lacks.
Since it is a NoSQL data store it doesn't follow SQL conventions and terminologies. HBase
provides a good set of APIs (includes JAVA and Thrift). Along with this HBase also provides
seamless integration with Map Reduce framework. But, along with all these advantages of
HBase you should keep this in mind that random read-write is quick but always has
additional overhead. So, think well before ye make any decision.
HBase relies completely on Zookeeper. HBase provides you the option to use its built-in
Zookeeper which will get started whenever you start HBAse. But it is not good if you are
working on a production cluster. In such scenarios it's always good to have a dedicated
Zookeeper cluster and integrate it with your HBase cluster. In Apache HBase, Zoo Keeper
coordinates, communicates, and shares state between the Masters and Region Servers. HBase
has a design policy of using Zoo Keeper only for transient data (that is, for coordination and
state communication). Thus, if the HBase’s Zoo Keeper data is removed, only the transient
operations are affected — data can continue to be written and read to/from HBase.
In short, the zookeeper is used to maintain the configuration information and communication
between region server and clients. It also provides distribution synchronization. It exposes
common services like naming, configuration management, and group services, in a simple
interface so you don't have to write them from scratch.
Zoo Keeper runs on a cluster of servers called an ensemble that shares the state of your data.
(These may be the same machines that are running other Hadoop services or a separate
cluster.) Whenever a change is made, it is not considered successful until it has been written
to a quorum (at least half) of the servers in the ensemble.
165
A leader is elected within the ensemble, and if two conflicting changes are made at the same
time, the one that is processed by the leader first will succeed and the other will fail. Zoo
Keeper guarantees that writes from the same client will be processed in the order they were
sent by that client. This guarantee, along with other features discussed below, allow the
system to be used to implement locks, queues, and other important primitives for distributed
queuing. The outcome of a write operation allows a node to be certain that an identical write
has not succeeded for any other node.
It is best to run your Zoo Keeper ensemble with an odd number of servers; typical ensemble
sizes are three, five, or seven. For instance, if you run five servers and three are down, the
cluster will be unavailable (so you can have one server down for maintenance and still
survive an unexpected failure). If you run six servers, however, the cluster is still unavailable
after three failures but the chance of three simultaneous failures is now slightly higher. Also
remember that as you add more servers, you may be able to tolerate more failures, but you
also may begin to have lower write throughput.
You’ll then need to install the correct CDH4 package repository for your system and install
the zookeeper package (required for any machine connecting to Zoo Keeper) and the
zookeeper-server package (required for any machine in the Zoo Keeper ensemble).
The warnings you will see indicate that the first time Zoo Keeper is run on a given host; it
needs to initialize some storage space. You can do that as shown below and start a Zoo
Keeper server running in a single-node/standalone configuration.
Creating a znode is as easy as specifying the path and the contents. Create an empty znode to
serve as a parent ‘directory’, and another znode as its child:
166
[zk: localhost:2181(CONNECTED) 2] create /zk-demo ''
Created /zk-demo
Created /zk-demo/my-node
You can then read the contents of these znodes with the get command. The data contained in
the znode is printed on the first line, and metadata is listed afterwards.
Zoo Keeper can also notify you of changes in a znode’s content or changes in a znode’s
children. To register a “watch” on a znode’s data, you need to use the get or stat commands to
access the current content or metadata and pass an additional parameter requesting the watch.
To register a “watch” on a znode’s children, you pass the same parameter when getting the
children with ls.
Created /watch-this
data
<metadata>
Modify the same znode (either from the current Zoo Keeper client or a separate one), and you
will see the following message written to the terminal:
WATCHER
Note that watches fire only once. If you want to be notified of changes in the future, you must
reset the watch each time it fires. Watches allow you to use zookeepers to implement
asynchronous, event-based systems and to notify nodes when their local copy of the data in
Zoo Keeper is stale.
167
3.8 DISTINGUISH BETWEEN HDFS AND HBASE
HDFS is fault-tolerant by design and supports rapid data transfer between nodes even during
system failures. HBase is a non-relational and open source Not-Only-SQL database that runs
on top of Hadoop. HBase comes under CP type of CAP (Consistency, Availability, and
Partition Tolerance) theorem.
HDFS is most suitable for performing batch analytics. However, one of its biggest drawbacks
is its inability to perform real-time analysis, the trending requirement of the IT industry.
HBase, on the other hand, can handle large data sets and is not appropriate for batch
analytics. Instead, it is used to write/read data from Hadoop in real-time.
Both HDFS and HBase are capable of processing structured, semi-structured as well as un-
structured data. HDFS lacks an in-memory processing engine slowing down the process of
data analysis as it is using plain old Map Reduce to do it. HBase, on the contrary, boasts of an
in-memory processing engine that drastically increases the speed of read/write.
HDFS is very transparent in its execution of data analysis. HBase, on the other hand, being a
NoSQL database in tabular format, fetches values by sorting them under different key values.
HBase is ideally suited for real-time environments, and this can be best demonstrated by
citing the example of our client, a renowned European bank. To derive critical insights from
the logs from application/web servers, we implemented solution in Apache Storm and
Apache HBase together. Given the huge velocity of data, we opted for HBase over HDFS as
HDFS does not support real-time writes. The results were overwhelming; it reduced the query
time from 3 days to 3 minutes.
Use Case 2 – Analytics Solution for Global CPG Player using HDFS & Map Reduce
With our global beverage player client, the primary objective was to perform batch analytics
to gain SKU level insights and involved recursive/sequential calculations. HDFS and Map
Reduce frameworks were better suited than complex Hive queries on top of HBase. Map
Reduce was used for data wrangling and to prepare data for subsequent analytics. Hive was
used for custom analytics on top of data processed by Map Reduce. The results were
impressive as there was a drastic reduction in the time taken to generate custom analytics – 3
days to 3 hours.
168
HDFS HBase
HDFS is a Java-based file system utilized for HBase is a Java based Not Only SQL
storing large data sets. database
HDFS is ideally suited for write-once and read- HBase is ideally suited for random write
many times use cases and read of data that is stored in HDFS.
3.9 SUMMARY
The normal file system was designed to work on a single machine or single operating
environment.
The smallest amount of data that is read and written onto a disk has something called
block size.
Getting Hadoop to work on the entire cluster involves getting the required software on
all the machines that are tied to the cluster. As per the norms one of the machines is
associated with the Name Node and another is associated with the Resource Manager.
HDFS allows user data to be organized in the form of files and directories. It provides
a command line interface called FS shell that lets a user interact with the data in
HDFS.
169
APIs are important software components bundled with the JDK. APIs in Java include
classes, interfaces, and user Interfaces.
The Hive Query Language (HiveQL) is a query language for Hive to process and
analyse structured data in a Metastore.
In Hive, the aggregate function returns a single value resulting from computation over
many rows.
3.10 KEYWORDS
JDBC driver - JDBC Driver is a software component that enables java application to
interact with the database. The JDBC-ODBC bridge driver uses ODBC driver to
connect to the database. The JDBC-ODBC bridge driver converts JDBC method calls
into the ODBC function calls. This is now discouraged because of thin driver. It is
easy to use and can be easily connected to any database.
Datasets – A data set (or dataset) is a collection of data. In the case of tabular data, a
data set corresponds to one or more database tables, where every column of a table
represents a particular variable, and each row corresponds to a given record of the
data set in question. The data set lists values for each of the variables, such as height
and weight of an object, for each member of the data set. Each value is known as a
datum. Data sets can also consist of a collection of documents or files.
Name nodes - Name Node works as Master in Hadoop cluster. It Stores metadata of
actual data. It also manages File system namespace along with regulating client access
request for actual file data file. It is used to assign work to Slaves i.e., Data Nodes and
executes file system name space operation like opening/closing files, renaming files
and directories.
170
As Name node keep metadata in memory for fast retrieval, the huge amount of
memory is required for its operation. This should be hosted on reliable hardware.
Shell Commands – The shell is the command interpreter on the Linux systems. It the
program that interacts with the users in the terminal emulation window. Shell
commands are instructions that instruct the system to do some action. A shell is a
computer program that presents a command line interface which allows you to control
your computer using commands entered with a keyboard instead of controlling
graphical user interfaces (GUIs) with a mouse/keyboard combination.
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
171
3.12 UNIT END QUESTIONS
A. Descriptive Questions
Short Questions
Long Questions
3. Explain the Pros and Cons of UI against direct calling hdfs dfs function.
b. Huge datasets
c. Hardware at data
d. All of these
172
3. What is Hive?
b. relational database
c. OLTP
d. A language
d. Both A and C
c. Zoo Keeper solves this issue with its simple architecture and API.
d. The Zoo Keeper framework was originally built at "Google" for accessing
their applications in an easy and robust manner
Answers
173
3.13 REFERENCES
Reference Books
Shabbir, M.Q., Gardezi, S.B.W. Application of Big Data analytics and organizational
performance: the mediating role of knowledge management practices. J Big Data 7,
47 (2020).
García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2017). A comparison
on scalability for batch big data processing on Apache Spark and Apache Flink. Big
Data Analytics, 2(1). https://doi.org/10.1186/s41044-016-0020-2
Textbooks
Touil, M. (2019). Big Data: Spark Hadoop and Their databases. Independently
published.
Websites
https://www.edureka.co/blog/install-hadoop-single-node-hadoop-cluster
https://github.com/avast/hdfs-shell#advantages-ui-against-direct-calling-hdfs-dfs-
function
https://www.javatpoint.com/api-full-form
174