0% found this document useful (0 votes)
22 views

BDA UNIT -1_pdf

The document provides an overview of Big Data, its characteristics, and the limitations of traditional data management systems like RDBMS in handling large and complex data sets. It introduces Hadoop as a solution, detailing its components, features, and the Hadoop ecosystem, which includes HDFS, YARN, and MapReduce for efficient data processing and storage. Additionally, it discusses the applications of Big Data analytics in various fields, emphasizing the importance of leveraging data for informed decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

BDA UNIT -1_pdf

The document provides an overview of Big Data, its characteristics, and the limitations of traditional data management systems like RDBMS in handling large and complex data sets. It introduces Hadoop as a solution, detailing its components, features, and the Hadoop ecosystem, which includes HDFS, YARN, and MapReduce for efficient data processing and storage. Additionally, it discusses the applications of Big Data analytics in various fields, emphasizing the importance of leveraging data for informed decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 143

Outline

• Introduction to Big Data


• Problems with TraditionalApproach
• What is Hadoop
• Need of HDFS and MapReduce in Hadoop
• Why Can’t RDBMS used
• Google File System (GFS)
• Hadoop Distributed File System(HDFS)
• Building Blocks of Hadoop
• Introducing and Configuring Hadoop Cluster
• Configuring XML Files
Big Data
• Big Data is a term used to describe a collection of data that are so large and complex
and yet growing exponentially with time, where the traditional data management
tools are not able to store and process it efficiently
• The characteristics of big data is the Five V’s: Volume, Velocity, Variety, Veracity and
Value.

5 Vs Of Big Data Characteristics


Cont..
• The “big” in big data is not just about volume. While big data certainly involves
having a lot of data, big data does not refer to data volume alone. What it means is
that you are not only getting a lot of data. It is also coming at you fast, it is coming at
you in complex format, and it is coming at you from a variety of sources.
• Collecting huge amounts of data is not important. But, what the organizations do
with the data that matters. Big data can be analyzed for insights that lead to better
decisions and strategic business moves and process automation.
Big Data Analytics

• Big Data Analytics is “the process of examining large data sets containing a variety
of data types – i.e., Big Data – to uncover hidden patterns, unknown correlations,
market trends, customer preferences, and other useful information that can help
organizations make informed business decisions.
• Data Analytics often obtain several business benefits, including more effective
marketing campaigns, the discovery of new revenue opportunities, improved
customer service delivery, more efficient operations, and competitive advantages.
• Big Data Analytics gives analytics professionals, such as data scientists and
predictive modelers, the ability to analyze Big Data from multiple and varied sources,
including transactional data and other structured data.
How is Big Data is actually used

Example 1: Better Understand and Target Customers


 To better understand and Target Customers, companies expand their traditional
data sets with social media data, browser, text analytics, sensor data and many
more to get a complete picture of their customers.
 The objective in all these cases is to create a predictive model. So that many
companies like Telecom industry can able to predict customer churn, retailers
can predict what products will sell, and car insurance companies understand
how will their customers actually drive.
Example 2: Improving Health
 The computing power of big data analytics enables us to find new cures, better
understand and predict disease patterns. We can use all the data from smart watches
and wearable devices to better understand links between lifestyles and diseases.
 Big data analytics also allow us to monitor and predict epidemics and disease
outbreaks, simply by listening to what people are saying i.e. "Feeling rubbish today –
in bed with a cold” or searching for on the internet, i.e. ‘’cures for flu’’.
Example 3: Improving and Optimizing Cities and Countries
 Big data is used to improve many aspects of our cities and countries. For example, it
allows cities to optimize traffic flows based on real time traffic information as well as
social media and weather data.
 A number of cities are currently using big data analytics with the aim of turing
themselves into smart cities, where the transport infrastructure and utility processes
are all joined up. Where a bus would wait for a delayed train and where traffic
signals predict traffic volumes and operate to minimize jams
Applications of Big Data
Big Data applications are endless and found in various fields today. The major fields where big data is
being used are as follows:
Types of Analytics
Four Types of Analytics
1. Prescriptive – This type of analysis reveals what actions should be taken. This
is the most valuable kind of analysis and usually results in rules and
recommendations for next steps.
2. Predictive – An analysis of likely scenarios of what might happen. The
deliverables are usually a predictive forecast.
3. Diagnostic – A look at past performance to determine what happened and
why. The result of the analysis is often an analytic dashboard.
4. Descriptive – What is happening now based on incoming data. To mine the
analytics, you typically use a real-time dashboard and/or email reports.
Traditional Data Systems
• Every year organizations need to store more and more detailed information for longer
periods of time.
• Increased regulation in areas such as health and finance are significantly increasing
storage volumes.
• Expensive shared storage systems often store this data because of the critical nature
of the information.
• Managing the volume and cost of this data growth within these traditional systems is
usually a stress point for IT organizations.
• Examples of data often stored in structured form include Enterprise Resource
Planning (ERP), Customer Resource Management (CRM), financial, retail, and
customer information.
Cont.
• Traditional data systems, such as relational databases and data warehouses, have
been the primary way businesses and organizations have stored and analyzed their
data for the past 30 to 40 years.
• Although other data stores and technologies exist, the major percentage of business
data can be found in these traditional systems.
• Traditional systems are designed from the ground up to work with data that has
primarily been structured data.
• Atomicity, Consistency, Isolation, Durability (ACID) accommodating systems and
the strategy around them are still important for running the business.
• A number of these systems were built over the years and support business decisions
that run an organization today. Relational databases and data warehouses can store
petabytes (PB) of information.
Cont.
• However, these systems were not designed from the ground up to address a number
of today’s data challenges.
• The cost, required speed, and complexity of using these traditional systems to
address these new data challenges would be extremely high
Problems with Traditional Approach (or) Limitations of RDBMS to
support big data
1. Schema on write
o Traditional systems are schema-on-write.
o It is defined as creating a schema for data before writing into the database. If you
have done any kind of development with a database you understand the structured
nature of Relational Database(RDBMS) because you have used Structured Query
Language (SQL) to read data from the database.
o This means that a lot of work must be done before new data sources can be
analyzed.
o The exploding growth of unstructured data and overhead of ETL for storing data
in RDBMS is the main reason for shift to schema on read.
Solution:
• Hadoop systems are schema-on-read, which means any data can be written to the
storage system immediately.
• Data are not validated until they are read. This enables Hadoop systems to load any
type of data and begin analyzing it quickly.
2. Scale-Up Instead of Scale-out
– First, the data size has increased tremendously to the range of
petabytes
one petabyte = 1,024 terabytes

– RDBMS finds it challenging to handle such huge data volumes. To


address this, RDBMS added more central processing units (or CPUs)
or more memory to the database management system to scale up
vertically.
– Scaling commercial relational database is expensive because it uses
vertical scalability(scale up)
• Horizontal scaling means that you scale by adding more machines into your pool of
resources whereas Vertical scaling means that you scale by adding more power
(CPU, RAM) to an existing machine.
• Hadoop uses horizontal scalability hence it is cost effective
3. High Cost of storage
 Traditional data use centralized database architecture in which large and complex
problems are solved by a single computer system. Centralized architecture is costly
and ineffective to process large amount of data.
 Big data is based on the distributed database architecture where a large block of
data is solved by dividing it into several smaller sizes. Then the solution to a problem
is computed by several different computers present in a given computer network
 Also the distributed database has more computational power as compared to the
centralized database system which is used to manage traditional data.
4. Bringing Data to the Programs
In relational databases and data warehouses, data are loaded from shared storage
elsewhere in the datacenter. The data must go over wires and through switches that
have bandwidth limitations before programs can process the data. For many types of
analytics that process 10s, 100s, and 1000s of terabytes, the capability of the
computational side to process data greatly exceeds the storage bandwidth available.
5. Handling Heterogeneity of data
In traditional approach, the main issue was handling the heterogeneity of
data i.e. structured, semi-structured and unstructured. The RDBMS (SQL)
focuses mostly on structured data such as database tables that confirm to
particular schema. But most of the big data is semi structured or unstructured
with which SQL cannot work.
6. Lack of High Velocity
“Big data” is generated at a very high velocity. RDBMS lacks in high velocity
because it’s designed for steady data retention rather than rapid growth. Even if
RDBMS is used to handle and store “big data,” it will turn out to be very expensive.
7. Data Storage and Analysis:
◦ Although the storage capacities of hard disk drives have increased, but the disk
access speed or transfer speed, the speed at which data can be read from drives has
not increased (i.e. enormous time taken to process).
◦ For example, now a days a tera bytes drives are available, but transfer speed is
around 100 MB/S, so it takes more than Two and half hours to read all the data.
Another way to reduce the time is to read from multiple disks at once, working in
parallel we could read the data in two minutes
8. Declarative Query Instead of Functional Programming:
– Data in RDBMS should be relational. Big Data need not be relational.
– In the support of relational data SQL is used.
– SQL is Fundamentally a high level declarative language, we query data by
stating result you want. Under MapReduce(of Hadoop), we can specify the
actual steps to query the data
– Its inadequacy to operate with languages outside of SQL
– Relational data is often normalized to remove redundancy

The solution to above said problems is Hadoop ecosystem, where it provides a


reliable data storage and analysis system
Hadoop

1. Hadoop is an open source distributed processing framework that manages data


processing and storage for big data applications
2. It is designed to run on cluster of commodity hardware (it doesn’t require
expensive, high reliable network)

3. It is a project of Apache software Foundation, It was created by computer scientists


Doug Cutting and Mike Cafarella in 2005.

4. The emphasis is on high throughput of data access rather than low latency of data
access
Cont..

4. Hadoop provide interface for many applications to move themselves closer to interact where
the data is located
5. After Google published technical papers detailing its Google File System (GFS) and
MapReduce programming framework in 2003 and 2004, respectively, Cutting and Cafarella
modified earlier technology plans and developed a Java-based MapReduce implementation
and a file system modeled on Google's called “Hadoop Framework”
6. Hadoop is an Apache top-level project being built and used by a global community
of contributors and users. It is licensed under the Apache License 2.0.
Features (or) Benefits of Hadoop
1. It is completely open source and written in Java
2. Highly Scalable: Hadoop cluster is scalable means we can add any number of nodes
(horizontal scalable) or increase the hardware capacity of nodes (vertical scalable) to
achieve high computation power. This provides horizontal as well as vertical scalability to
the Hadoop framework.
3. Computing power: It uses Distributed computing framework that’s designed to provide
rapid data access across the nodes in a cluster
4. Fault-tolerant :
– It provide fault-tolerant capabilities so applications can continue to run if individual
nodes fail.
– HDFS in Hadoop 2.0 uses a replication mechanism to provide fault tolerance. It creates
a replica of each block on the different machines depending on the replication factor (by
default, it is 3). So if any machine in a cluster goes down, data can be accessed from the
other machines containing a replica of the same data.
5. Cost effective: It doesn’t require expensive, high reliable network. i.e. It runs on
clusters of commodity servers and can scale up to support thousands of hardware
nodes and massive amounts of data.
6. Faster in Data Processing :Hadoop stores data in a distributed fashion, which
allows any kind of data (unstructured or Semi-structured) to be processed
distributedly on a cluster of nodes. Thus it provides lightning-fast processing
capability to the Hadoop framework
7. High availability:
– This feature of Hadoop ensures the high availability of the data, even in
unfavorable conditions
– Due to the fault tolerance feature of Hadoop, if any of the DataNodes goes down,
the data is available to the user from different DataNodes containing a copy of the
same data.
Cont..
8. Ensures Data Reliability:
In Hadoop due to the replication of data in the cluster, data is stored reliably on the
cluster machines despite machine failures.
9. Used for Batch processing (Not for online[real time] analytical processing)
10. Flexibility:
Store any amount of any kind of data
11. Data Locality concept:
Hadoop is popularly known for its data locality feature means moving computation
logic to the data, rather than moving data to the computation logic. This features of
Hadoop reduces the bandwidth utilization in a system.
Hadoop Ecosystem
Following are the components that collectively form a Hadoop ecosystem:

•HDFS: Hadoop Distributed File System


•YARN: Yet Another Resource Negotiator
•MapReduce: Programming based Data Processing
•Spark: In-Memory data processing
•PIG, HIVE: Query based processing of data services
•HBase: NoSQL Database
•Mahout, Spark MLLib: Machine Learning algorithm libraries
•Solar, Lucene: Searching and Indexing
•Zookeeper: Managing cluster
•Oozie: Job Scheduling
HDFS:

•HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining
the metadata in the form of log files.
•HDFS consists of two core components i.e.
• Name node
• Data Node
•Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data nodes
are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost
effective.
•HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
YARN:

•Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
•Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
•Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:

•By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big
data sets into a manageable one.
•MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
• Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
• Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
•It is a platform for structuring the data flow, processing and analyzing huge data sets.
•Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
•Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the
way Java runs on the JVM.
•Pig helps to achieve ease of programming and optimization and hence is a major segment of the
Hadoop Ecosystem.
HIVE:

•With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive Query
Language).
•It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query processing
easier.
•Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
•JDBC, along with ODBC drivers work on establishing the data storage permissions
and connection whereas HIVE Command line helps in the processing of queries.
Mahout:

•Mahout, allows Machine Learnability to a system or application. Machine Learning, as


the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
•It provides various libraries or functionalities such as collaborative filtering, clustering,
and classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.
Apache Spark:

•It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
•It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
•Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.
Other Components: Apart from all of these, there are some other components too that carry out a
huge task in order to make Hadoop capable of processing large datasets. They are as follows:

•Solr, Lucene: These are the two services that perform the task of searching and indexing with
the help of some java libraries, especially Lucene is based on Java which allows spell check
mechanism, as well. However, Lucene is driven by Solr.
•Zookeeper: There was a huge issue of management of coordination and synchronization among
the resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper
overcame all the problems by performing synchronization, inter-component based communication,
grouping, and maintenance.
•Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator
jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas
Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to
it.
Components of Hadoop
Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and
solutions.

Hadoop Framework includes following four modules:


 Hadoop Distributed File System(Storage Layer):
o HDFS is based on Google File System (GFS) and provides a distributed file system that is
designed to be run on commodity hardware.
o It provides high throughput access and storage to application data and is suitable for
applications having large datasets
Cont..

 Hadoop MapReduce( Processing Layer) – It is a parallel programming


model for writing distributed applications that enables efficient processing
of large amounts of data (multi terabytes datasets)
 Hadoop YARN – This is a framework for job scheduling and cluster
resource management
 Hadoop Common – These are java libraries and utilities required by other
Hadoop modules. These libraries provides file system and OS level
abstractions and contains the necessary java files and scripts required to
start Hadoop
What is Distributed File System
o A distributed file system is a client/server-based application that allows clients to
access and process data stored on the server as if it were on their own computer.
o When a user accesses a file on the server, the server sends the user a copy of the file,
which is cached on the user's computer while the data is being processed and is then
returned to the server.
o DFS organizes files in a hierarchical file management system
o Distributed file systems can be advantageous because they make it easier to
distribute documents to multiple clients and they provide a centralized storage
system so that client machines are not using their resources to store files.

Benefits:
 Resource Management and Accessibility
 Fault Tolerance
 Workload Management
Difference between Distributed System and Centralized
System
Google File System(GFS) Introduction
• Google File System is proprietary distributed file system developed by Google Inc. for
its own use.
• It is designed to provide efficient, reliable data access using large cluster of commodity
hardware
• GFS is made up of several storage systems built from low-cost commodity hardware
components
• GFS was implemented especially for meeting the rapidly growing demands of
Google’s data processing needs.
• A new version of the Google File System is code named Colossus which was released
in 2010
What is GFS
• Google File System is essentially a distributed file storage.
• In any given cluster of Google file system, there can be hundreds (or) thousands of commodity servers
• This cluster provides an interface for ‘N’no of clients either to read a file/ write into a file.
Design Considerations

1. Commodity Hardware:
Google was still a young company, Instead of buying an expensive server, they
chose to buy off-the-shelf commodity hardware because they are cheap and secondly
using a lot of such servers they could scale horizontally given the right software
layer created on top of it.
2. Large Files:
 The second design consideration in the google file system is optimized to store
and read large files.
 A typical file in GFS range is from 100MB to multiple GB files
3. File Operations:
The third most design consideration of GFS describes about two kinds of file
operations.
• Writes to any file(generally appends only) and no random writes in the file
• Perform Sequential reads on the file
4. Chunks
– A single file is not stored in a single server. It is actually subdivided into multiple
chunks and each chunk is of 64 MB

– These chunks can be spread across multiple machines(chunkServers) and these


chunks are identified globally by assigning unique 64 bit ID

5. Single master for multi-TB cluster


– In the Architecture of GFS, We have only one master server for an entire cluster.
So all the operations (read/writing contents from files) performed by many
number of clients are monitored and controlled by this single master itself
Google File System(GFS) Architecture
Architecture
• GFS cluster consists of multiple nodes. These nodes are divided into two types:
• Single GFS Master
• Large No of GFS ChunkServers
• Each GFS files are divided into fixed size chunks(64 MB)
• Chunkservers store these chunks as linux files on local disks.
• Each chunk is assigned a unique 64-bit label by the master node at the time of
creation to mountain the logical meaning of files to constituent chunks.
• For reliability Each chunk is replicated several times throughout the network, with
minimum being ‘3’(default-replication factor), even more for files that high in
demand.
• Each Files are stored in hierarchical directories identified by path names.
GFS Master
• The master server acts as the coordinator for the cluster.
• The master's duties include:
– maintaining an operation log, which keeps track of the activities of the master's cluster.
The operation log helps keep service interruptions to a minimum -- if the master server
crashes, a replacement server that has monitored the operation log can take its place.
– The master server also keeps track of metadata, which is the information that describes
chunks. The metadata tells the master server to which files the chunks belong and where
they fit within the overall file.
– Upon startup, the master polls all the chunkservers in its cluster. The chunkservers
respond by telling the master server the contents of their inventories. From that moment
on, the master server keeps track of the location of chunks within the cluster
Cont..
• The master server does not usually store the actual chunks, but rather all the metadata
associated with the chunks, such as namespace, access control data, the tables mapping
64- bit labels to chunk locations- is controlled by master.
• All this metadata kept by master server is updated by receiving updates from chunk-
server(Heart-Beat Messaging)
GFS Client
• GFS client code linked into each application implements the file system API and
communicates with the chunk servers to read or write data on behalf of the
application.
• Clients interact with the master for metadata operations, but all data-bearing
communication goes directly to the chunkservers.

GFS ChunkServers
• Chunkservers are the workhorses of the GFS. They're responsible for storing the 64-
MB file chunks. The chunkServers don't send chunks to the master server. Instead,
they send requested chunks directly to the client.
• Neither the client nor the chunk-server caches file data. Chunk servers need not to be
cache file data because chunks are stored as local files and so Linux’s buffer already
keeps frequently accessed data in memory.
• If ChunkServer is down, master ensures all chunks that were on it are copied on
others servers.
• Ensures replica counts remain same
GFS Features Include:
• Fault tolerance
• Critical data replication
• Automatic and efficient data recovery
• High aggregate throughput
• Reduced client and master interaction because of large chunkserver size
• Namespace management and locking
• High availability
Hadoop Distributed File System(HDFS)
• HDFS is the primary storage system used by Hadoop applications
• It is designed for storing very large files with streaming data access patterns, running
on cluster of commodity hardware
• HDFS is a distributed, scalable and portable file system written in java for the Hadoop
Framework
• HDFS stores files as blocks in clusters, typically 128 MB (default block size for
Hadoop2.0)
• HDFS creates multiple replicas of data blocks and distributes them on compute nodes
throughout a cluster to enable reliable, extremely rapid computations
Features of HDFS:
 It contains a master-slave architecture
 Hadoop provides a command interface to interact with HDFS
 It provides a Single Namespace for entire cluster
 HDFS is optimized for throughput over latency
 It is very efficient at streaming read request for large files but poor at
seek requests for many small files
 The built-in servers of Namenode and Datanode help users to easily
check the status of cluster
Difference between Google File System and Hadoop
Distributed File System
Cont..
HDFS Architecture
The File System Namespace
• HDFS supports a traditional hierarchical file organization. A user or an
application can create directories and store files inside these directories.
• The file system namespace hierarchy is similar to most other existing file
systems; one can create and remove files, move a file from one directory to
another, or rename a file. HDFS does not support hard links or soft links.
• The NameNode maintains the file system namespace.
• Any change to the file system namespace or its properties is recorded by the
NameNode. An application can specify the number of replicas of a file that
should be maintained by HDFS.
• The number of copies of a file is called the replication factor of that file. This
information is stored by the NameNode.
Replication in HDFS
• HDFS provides a reliable way to store huge data in a distributed environment as data
blocks. The blocks are also replicated to provide fault tolerance.
• The replication factor represents number of copies of a block that must be there in the
cluster.
• This value is by default 3 (comprises one original block and 2 replicas). So, every
time we create a file in HDFS will have a replication factor as 3.
• hdfs-site.xml configuration file is used to control the HDFS replication factor
<property>
<name>dfs.replication</name>
<value>3</value>
<description>BlockReplication</description>
</property>
Building Blocks or Daemons of Hadoop
 On a fully configured cluster, “running Hadoop” means running a set of
daemons, or resident programs, on the different servers in our network. These
daemons have specific roles; some exist only on one server, some exist across
multiple servers.
 The daemons includes:
 NameNode (Master Node)
 DataNode (Slave Node)
 Secondary NameNode (Check-point node)
 Job Tracker
 Task Tracker

Note: A daemon is computer program, that runs as a background process, rather than
being direct control of an Interactive User
NameNode
• Hadoop cluster consists of single Namenode
• The Namenode is the main central component of HDFS architecture framework, that
directs the slave DataNodes to perform low level I/O tasks
• NameNode doesn't store any user data (actual data) or perform any computation
• It is the bookkeeper of HDFS i.e. it keeps track of files metadata, how the files are
broken down into file blocks, which node store those blocks and overall health of
distributed file system
• NameNode is a very highly available server that manages the File System
Namespace and controls access to files by clients.
Functions of Namenode

• It is the master daemon that maintains and manages the DataNodes (slave nodes)
and assign the task to them
• It records the metadata of all the files stored in the cluster, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc. There are two files associated
with the metadata:
• FsImage: It contains the complete state of the file system namespace since the
start of the NameNode.
• EditLogs: It contains all the recent modifications made to the file system with
respect to the most recent FsImage
• It records each change that takes place to the file system metadata. For example, if a
file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
Cont..
• It regularly receives a Heartbeat signals and a block report from all the DataNodes in
the cluster to ensure that the DataNodes are live.
• The Namenode executes file system operations like opening, closing and renaming
the files and directories
• It keeps a record of all the blocks in HDFS and in which nodes these blocks are
located
• The NameNode is also responsible to take care of the replication factor of all the
blocks.
Cont..
• In case of the DataNode failure, the NameNode chooses new DataNodes for
new replicas, balance disk usage and manages the communication traffic to the
DataNodes.
• There is unfortunately a negative aspect of Namenode i.e. a single point of
failure in Hadoop Cluster.
• For any of the other daemons, if their host nodes fail for software or hardware
reasons, the Hadoop cluster will likely continue to function smoothly or you
can quickly restart it. But, when the Namenode is down, HDFS/Hadoop
Cluster is inaccessible and considered down.
DataNode

• Each slave node in the Hadoop Cluster is a DataNode


• It is responsible for storing and managing the actual data on the slave node.
• A functional filesystem has more than one DataNode, with data replicated across
them
• The client writes data to one slave node and then it is responsibility of DataNode to
replicates data to the slave nodes according to replication factor.
Functionalities of DataNode
• These are slave daemons or process which runs on each slave machine.
• The actual data is stored on DataNodes
• The DataNodes perform the low-level read and write requests from the file system’s
clients
• When the client want to read or write a HDFS file, the file is broken into blocks and
the Namenode tell to the client on which DataNode the block resides. So that the
client directly communicates with DataNodes to process the local file.
• A DataNode may further communicate with other DataNodes to replicate its data
blocks for redundancy.
• It maintains 3 replicas for each block. A block is replicated over three right most
nodes. This ensure that if any node crashes, you still able to read the file.
Interaction between NameNode and DataNode
Cont..
• Upon Initialization, each DataNode informs the NameNode regarding which blocks
it currently stores. After mapping is completed, DataNode continually poll the
Namenode to provide information regarding local changes.
• During Normal operation, DataNodes send heartbeats to the NameNode to
conform that DataNodes are functioning properly and operating in a controlled
environment.
• The default heartbeat interval is three seconds. If a DataNode doesn't receive a
heartbeat from a DataNode in ten minutes, then the Namenode consider the
DataNode to be out of service. The Namenode then schedules creation of new
replicas of those blocks on other node.
• The NameNode keeps track of the file metadata—which files are in the system and
how each file is broken down into blocks. The DataNodes provide backup store of
the blocks and constantly report to the NameNode to keep the metadata in a
updated manner.
DataNode Sends Heartbeats to NameNode
Secondary NameNode (SNN)
• Apart from these two daemons, there is a third daemon or a process called Secondary
NameNode. The Secondary NameNode works concurrently with the primary
NameNode as a helper daemon.
• The Secondary NameNode (SNN) is an assistant(helper) daemon for monitoring the
state of the cluster HDFS. Like the NameNode, each cluster has one SNN, and it
typically resides on its own machine as well
• The secondary Namenode daemon is responsible for performing periodic
housekeeping functions for Namenode
• It only creates checkpoints of the filesystem metadata and it is not be a back-up
for Namenode
Cont..
• The main function of the Secondary NameNode is to store the latest copy of the
FsImage and the Edits Log files.
• Hence it is also called Checkpoint node
• A checkpoint is nothing but the updation of the latest FsImage file by applying the
latest Edits Log files to it .
• If the time gap of a checkpoint is large the there will be too many Edits Log files
generated and it will be very cumbersome and time consuming to apply them all at
once on the latest FsImage file . And this may lead to acute start time for the primary
NameNode after a reboot .
• However, the secondary NameNode is just a helper to the primary NameNode in a
HDFS cluster as it cannot perform all the functions of the primary NameNode .
The below figure shows the working of Secondary Namenode
1. It gets the edit logs from the Namenode in regular intervals and applies to fsimage
2. Once it has new fsimage, it copies back to Namenode
3. Namenode will use this fsimage for the next restart,which will reduce the startup tim
JoB Tracker
• A Hadoop cluster is a collection of computers, known as nodes, that are networked together
to perform these kinds of parallel computations on big data sets. Unlike other computer
clusters, Hadoop clusters are designed specifically to store and analyze mass amounts of
structured and unstructured data in a distributed computing environment.
• Job Tracker runs on a server as a master node of the cluster
• The job Tracker is a communicator between client application and Hadoop
• There will be only one job Tracker daemon per Hadoop Cluster
• Once code is submitted, the job tracker determines the execution plan by determining:
• Which files to process
• Assign nodes to different tasks
• Monitor all running tasks
• If any task fails, job tracker automatically relaunch the task on a different node
Role of Job Tracker
1. Client applications submit jobs to the Job tracker.
2. The JobTracker talks to the NameNode to determine the location of the data
3. The JobTracker submits the work to the chosen TaskTracker nodes.
4. The JobTracker locates TaskTracker nodes with available slots at or near the data.
5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough,
they are deemed to have failed and the work is scheduled on a different TaskTracker
6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to
do then: it may resubmit the job elsewhere, it may mark that specific record as something to
avoid, and it may may even blacklist the TaskTracker as unreliable.
7. When the work is completed, the JobTracker updates its status.
8. Client applications can poll the JobTracker for information.
9. The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all
running jobs are halted.
File read operation
Job Tracker and Task Tracker Interaction
• After a client calls the JobTracker to begin a data processing job, the JobTracker
partitions the work and assigns different map and reduce tasks to each TaskTracker in
the cluster.
Task Tracker
• TaskTracker runs on a Datanode in the cluster that accepts tasks - Map, Reduce and
Shuffle operations
• Each Task Tracker is responsible for executing the individual tasks on each slave
node that the Job Tracker assigns.
• Every TaskTracker is configured with a set of slots, these indicate the number of
tasks that it can accept. When the JobTracker tries to find somewhere to schedule a
task within the MapReduce operations, it first looks for an empty slot on the same
server that hosts the DataNode containing the data, and if not, it looks for an empty
slot on a machine in the same rack.
• Although there is a single Task Tracker per slave node, each Task Tracker can run
multiple JVMs to handle many map or reduce tasks in parallel.
Cont..

• One responsibility of the Task Tracker is to constantly communicate with


the Job Tracker to send the status of the jobs.
• If the Job Tracker fails to receive a heartbeat from a Task Tracker within
a specified amount of time, it will assume the Task Tracker has crashed
and will resubmit the corresponding tasks to other nodes in the cluster.
File Read Operation
Anatomy of a File Read (or) How is data read from HDFS
• Step 1: The client opens the file it wishes to read by calling open() on the File
System Object(which for HDFS is an instance of Distributed File System).
• Step 2: Distributed File System( DFS) calls the name node, using remote procedure
calls (RPCs), to determine the locations of the first few blocks in the file. For each
block, the name node returns the addresses of the data nodes that have a copy of that
block. The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the data node
and name node I/O.
• Step 3: The client then calls read() on the stream. DFSInputStream, which has stored
the info node addresses for the primary few blocks within the file, then connects to
the primary (closest) data node for the primary block in the file.
• Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
• Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block. This
happens transparently to the client, which from its point of view is simply reading
an endless stream. Blocks are read as, with the DFSInputStream opening new
connections to data nodes because the client reads through the stream. It will also
call the name node to retrieve the data node locations for the next batch of blocks
as needed.
• Step 6: When the client has finished reading the file, a function is called, close()
on the FSDataInputStream.
File write operation
A sample code to read a file to HDFS in Java is as follows:
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/path/to/file.ext");
if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
return;
}
FSDataInputStream in = fileSystem.open(path);
int numBytes = 0;
while ((numBytes = in.read(b))> 0) {
System.out.prinln((char)numBytes));// code to manipulate the data which is read
}
in.close();
out.close();
fileSystem.close();
Anatomy of a File Write (or) How is data written in HDFS
• Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
• Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has the
right permissions to create the file. If these checks pass, the name node prepares a
record of the new file; otherwise, the file can’t be created and therefore the client is
thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for the
client to start out writing data to.
• Step 3: Because the client writes data, the DFSOutputStream splits it into packets,
which it writes to an indoor queue called the info queue. The data queue is consumed
by the DataStreamer, which is liable for asking the name node to allocate new blocks
by picking an inventory of suitable data nodes to store the replicas. The list of data
nodes forms a pipeline, and here we’ll assume the replication level is three, so there
are three nodes in the pipeline. The DataStreamer streams the packets to the primary
data node within the pipeline, which stores each packet and forwards it to the second
data node within the pipeline.
• Step 4: Similarly, the second data node stores the packet and forwards it to the
third (and last) data node in the pipeline.
• Step 5: The DFSOutputStream sustains an internal queue of packets that are
waiting to be acknowledged by data nodes, called an “ack queue”.
• Step 6: This action sends up all the remaining packets to the data node
pipeline and waits for acknowledgments before connecting to the name node
to signal whether the file is complete or not.
• HDFS follows Write Once Read Many models. So, we can’t edit files that are
already stored in HDFS, but we can include it by again reopening the file. This
design allows HDFS to scale to a large number of concurrent clients because
the data traffic is spread across all the data nodes in the cluster. Thus, it
increases the availability, scalability, and throughput of the system.
A sample code to write a file to HDFS in Java is as follows:
FileSystem fileSystem = FileSystem.get(conf);
// Check if the file already exists
Path path = new Path("/path/to/file.ext");
if (fileSystem.exists(path)) {
System.out.println("File " + dest + " already exists");
return;
}
// Create a new file and write data to it.
FSDataOutputStream out = fileSystem.create(path);
InputStream in = new BufferedInputStream(new FileInputStream(
new File(source)));
byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}
// Close all the file descripters
in.close();
out.close();
fileSystem.close();
Introducing and Configuring Hadoop cluster
• Hadoop is supported by GNU/Linux platform and its flavors.
• Before installing Hadoop into the Linux environment, we need to set up Linux using
ssh (Secure Shell).
Creating a User:
 At the beginning, it is recommended to create a separate user for Hadoop to isolate
Hadoop file system from Unix file system. Follow the steps given below to create a
user:
 Open the root using the command “su”.
 Create a user from the root account using the command “useradd username”.
 Now you can open an existing user account using the command “su username”.
• New passwd: ******
• Retype new passwd : ******
 Open the Linux terminal and type the following commands to create a user.
• $ su password:
• # useradd hadoop
• # passwd hadoop
• #New passwd: ******
• #Retype new passwd : ******
SSH Setup and Key Generation
• SSH setup is required to do different operations on a cluster such as starting,
stopping, distributed daemon shell operations
• To authenticate different users of Hadoop, it is required to provide public/private
key pair for a Hadoop user and share it with different users.
• The following commands are used for generating a key value pair using SSH.
• Copy the public keys form id_rsa.pub to authorized_keys, and provide the owner
with read and write permissions to authorized_keys file respectively.
o $ ssh-keygen -t rsa
o $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
o $ chmod 0600 ~/.ssh/authorized_keys
Installing Java
• Java is the main prerequisite for Hadoop.
• First of all, you should verify the existence of java in your system using the command
“java -version”.
• The syntax of java version command is given below.
$ java –version
• If everything is in order, it will give you the following output. java version
"1.7.0_71“
• Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM)
Client VM (build 25.0-b02, mixed mode)
• For setting up PATH and JAVA_HOME variables, add the
following commands to ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
• If java is not installed in your system, then follow the steps given below for
installing java.
• Download java (JDK <latest version> - X64.tar.gz) by visiting the following link
• http://www.oracle.com/technetwork/java/javase/downloads/jdk7-
downloads1880260.html.
• Then jdk-7u71-linux-x64.tar.gz will be downloaded into your system.
• Generally you will find the downloaded java file in Downloads folder. Verify it and
extract the jdk-7u71-linux-x64.gz file using the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
Downloading Hadoop
• Download and extract Hadoop 2.4.1 from Apache software foundation using the
following commands
$ su password:
# cd /usr/local
#we get http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop 2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/ # exit
Hadoop Operational Modes

• Hadoop is a framework specially designed for the distributed batch processing and
storage of enormous datasets on commodity hardware.
• Hadoop being an open source framework can be utilized on a single machine or even
a cluster of machines
• Hadoop is very much efficient in the distribution of the huge datasets on commodity
hardware
• Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of
the three supported modes:
 Standalone mode (single node cluster)
 Pseudo distributed mode (single node cluster)
 Fully distributed mode (multi node cluster)
Standalone(Local) mode
• The standalone mode is the default mode in which Hadoop run
• With empty configuration files, Hadoop will run completely on a single machine
(non distributed mode as a single java Process)
• Because there’s no need to communicate with other nodes, the standalone mode
doesn’t use HDFS, nor will it launch any of the Hadoop daemons and everything
will runs in single JVM Instance
• Its primary use is for developing and debugging the application logic of a
MapReduce program without the additional complexity of interacting with the
daemons
• Standalone mode is usually the fastest Hadoop modes as it uses the local file
system for all the input and output
• When Hadoop works in this mode there is no need to configure the files – hdfs-
site.xml, mapred-site.xml, core-site.xml for Hadoop environment. In this Mode, all
of your Processes will run on a single JVM(Java Virtual Machine) and this mode can
only be used for small development purposes.
Pseudo Distributed mode
• The pseudo-distribute mode is also known as a single-node cluster where both
NameNode and DataNode will reside on the same machine.
• In pseudo-distributed mode, all the Hadoop daemons will be running on a single
node. Such configuration is mainly used while testing when we don’t need to think
about the resources and other users sharing the resource.
• In this architecture, a separate JVM is spawned for every Hadoop components
as they could communicate across network sockets, effectively producing a fully
functioning and optimized mini-cluster on a single host
• It uses HDFS for storage and YARN is also used for managing the resources in
hadoop Installation
• Replication Factor will be ONE for Block
• Changes in configuration files will be required for all the three files- mapred-site.xml,
core-site.xml, hdfs-site.xml
Fully distributed mode
• This is the production mode of Hadoop where multiple nodes will be running. Here data
will be distributed across several nodes and processing will be done on each node.
• Master and Slave services will be running on the separate nodes in fully distributed Hadoop
Mode.
• The following are the three server names used to set up a full cluster.
– The master node of the cluster and host of the NameNode and JobTracker
daemons,
– The server that hosts the Secondary NameNode daemon slave1, slave2, slave3, …
– The slave boxes of the cluster running both DataNode and TaskTracker daemons.
Multiple nodes are used to operate Hadoop in Fully Distributed Mode
Hadoop Operational Modes-Summary
Configuring XML files

The following files are the important configuration files for the runtime environment
settings of a Hadoop cluster.
• Core-Site.xml
• Hdfs-site.xml
• Mapred-site.xml
• Yarn-site.xml
Core-site.xml:
• The core-site.xml file informs Hadoop where NameNode runs in the cluster. It
contains configuration settings for Hadoop code such as I/O settings that are
common to HDFS and MapReduce
• The core-site.xml file contains information such as the port number used for
Hadoop instance, memory allocated for the file system, memory limit for storing
the data, and size of Read/Write buffers
Location of the file: /etc/Hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name </name>
<value> hdfs://localhost:9000</value>
</property>
</configuration>
Hdfs-site.xml

• The hdfs-site.xml file contains the configuration settings for HDFS daemons; the
NameNode, the Secondary NameNode, and the DataNodes.
• This xml file also provides paths of NameNode and DataNode
• Here, we can configure hdfs-site.xml to specify default block replication and
permission checking on HDFS.
• The actual number of replications can also be specified when the file is created. The
default is used if replication is not specified in create time.
<configuration>
<!--To configure/specify Replica-->
<property>
<name>dfs.replication</name>
<value>1</value>
</Property>
<!--To configure/specify NameNode Metadata location-->
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<!--To configure/specify DataNode Metadata location-->
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
</configuration>
Mapred-site.xml
• It is one of the important configuration files which is required for runtime
environment settings of a Hadoop.
• This file contains the configuration settings for MapReduce daemons; the job tracker
and the task-trackers.
• The mapred.job.tracker parameter is a hostname (or IP address) and port pair on
which the Job Tracker listens for RPC communication. This parameter specify the
location of the Job Tracker to Task Trackers and MapReduce clients.
<configuration>
<property>
<name>mapred.job.tracker</name>
<value> hdfs://localhost:9000</value>
</property>
</configuration>
Yarn-site.xml:
• This file is used to configure yarn into Hadoop.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Verifying Hadoop Installation
Step 1: Name Node Setup
Set up the namenode using the command “hdfs namenode - format” as follows.
$ cd ~
$ hdfs namenode –format
Step 2: Verifying Hadoop dfs( Start the master Deamon)
The following command is used to start dfs. Executing this command
will start your Hadoop file system
$ start-dfs.sh
Step 3:Start the MapReduce Deamons using following command:
$ start -mapred.sh
Step 4: Once started, check the status on the master and slave by using jps (java
Process Status) command
$ start jps
We will get the below Output:
14799 NameNode
15314 Jps
16977 secondaryNameNode
15183 DataNode
Working with files in HDFS

• Hadoop workflow creates data files(such as log files) elsewhere and copies them into
HDFS using one of command line utilities.
• HDFS is not a native UNIX file system. So, Standard UNIX file tools such as ls and
cp don’t work on it and neither standard file read and write operations, such as
fopen() and fread() don’t work.
• Hadoop provides a set of command line utilities to work similar to the Linux file
commands. After copying the files into HDFS mapreduce programs process this data.
But they don’t read HDFS file directly. Instead they relay on mapreduce framework
to read and parse the HDFS files into individual records(Key value pairs) which are
the unit of data mapreduce program to do work on.
Basic File Commands

• Hadoop file commands take the form of


hadoop fs - cmd <args>
Where cmd is the specific file command and <args> is a variable number of
arguments.
Hadoop-HDFS Operations
1. Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS
server), and execute the following command.
$ hadoop namenode -format
After formatting the HDFS, start the distributed file system.
The following command will start the namenode as well as the data nodes as cluster.
$ start-dfs.sh

2. Listing Files in HDFS


After loading the information in the server, we can find the list of files in a directory,
status of a file, using ‘ls’. Given below is the syntax of ls that you can pass to a directory
or a filename as an argument.
hadoop fs –ls <args>
3. Creating Directories:
• Before running Hadoop programs on data stored in HDFS, we need to put the data
in HDFS first. HDFS has a default working directory of /user.
• To create directory we use mkdir command.
hadoop fs -mkdir /user/555

4. Storing files in HDFS:


• One can put the file into current directory using fs -put
hadoop fs -put example.txt //It puts the file in local file system to HDFS
• After putting the files(Data) in HDFS, we can Hadoop program to process it. The
output of the processing will be a new set of files in HDFS.
Example:- hadoop fs –put ArrayListDemo.java/555
• it creates a folder 555 and copy the program into it
5. Retrieve files from HDFS:

• The Hadoop get is reverse of put command. It copies files from HDFS to local file system.
hadoop fs –get /555/ArrayListDemo.java
6. Display the content of HDFS files:
• Hadoop cat command allows us to display the content of the HDFS file.
hadoop fs – cat /555/ArrayListDemo.java
7. Deleting files:
• The rm command is used to remove files and empty directories.
hadoop fs - rm /555/ArrayListDemo.java
8. Looking up help:
• We can use hadoop fs (with no parameters) to get a complete list of all available
commands on hadoop. We can also use help to display the usage and short description
of each command.
hadoop fs - help ls
9. Shutting Down the HDFS
• You can shut down the HDFS by using the following command.
$ stop-dfs.sh
HDFS Goals

• Detection of faults and automatic recovery


• High throughput of data access rather than low latency
• Provide high aggregate data bandwidth and scale to hundreds of nodes in a singe
cluster
• Write-once-read-many access model for files
• Applications move themselves closer to where the data is located
• Easily Portable
Advantages of Hadoop
• Varied Data Sources
Hadoop accepts a variety of data. Data can come from a range of sources like email
conversation, social media etc. and can be of structured or unstructured form. Hadoop can
derive value from diverse data. Hadoop can accept data in a text file, XML file, images,
CSV files etc.
• Cost-effective
Hadoop is an economical solution as it uses a cluster of commodity hardware to store data.
Commodity hardware is cheap machines hence the cost of adding nodes to the framework is
not much high
• Performance
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
• Fault-Tolerant
Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.
• Highly Available
In Hadoop 2.0, HDFS architecture has a single active NameNode and a single Standby
NameNode, so if a NameNode goes down then we have standby NameNode to count on.
But Hadoop 3.0 supports multiple standby NameNode making the system even more highly
available as it can continue functioning in case if two or more NameNodes crashes.
• Low Network Traffic
In Hadoop, each job submitted by the user is split into a number of independent sub-tasks
and these sub-tasks are assigned to the data nodes thereby moving a small amount of code to
data rather than moving huge data to code which leads to low network traffic.
• High Throughput
Throughput means job done per unit time. Hadoop stores data in a distributed fashion which
allows using distributed processing with ease. A given job gets divided into small jobs which
work on chunks of data in parallel thereby giving high throughput.
• Open Source
Hadoop is an open source technology i.e. its source code is freely available. We can modify
the source code to suit a specific requirement.
• Scalable
Hadoop works on the principle of horizontal scalability i.e. we need to add the entire
machine to the cluster of nodes and not change the configuration of a machine like
adding RAM, disk and so on which is known as vertical scalability. Nodes can be added
to Hadoop cluster on the fly making it a scalable framework.
• Ease of use
The Hadoop framework takes care of parallel processing, MapReduce programmers
does not need to care for achieving distributed processing, it is done at the backend
automatically.
• Compatibility
Most of the emerging technology of Big Data is compatible with Hadoop like Spark,
Flink etc. They have got processing engines which work over Hadoop as a backend i.e.
We use Hadoop as data storage platforms for them.
• Multiple Languages Supported
Developers can code using many languages on Hadoop like C, C++, Perl, Python,
Ruby, and Groovy.
Disadvantages of Hadoop
• Supports Only Batch Processing
At the core, Hadoop has a batch processing engine which is not efficient in stream
processing. It cannot produce output in real-time with low latency. It only works on
data which we collect and store in a file in advance before processing.
• Iterative Processing
Hadoop cannot do iterative processing by itself. Machine Learning or iterative
processing has a cyclic data flow whereas Hadoop has data flowing in a chain of
stages where output on one stage becomes the input of another stage.
• Lack of Preventive Measures
When handling sensitive data collected by a company, it is mandatory to provide the
necessary security measures. In Hadoop, the security measures are disabled by
default. The person responsible for data analytics should be aware of this fact and
take the required measures to secure the data
• Not Fit for Small Data
Hadoop is suitable for a small number of large files but when it comes to the
application which deals with a large number of small files, Hadoop fails here. A
small file is nothing but a file which is significantly smaller than Hadoop’s block
size which can be either 128MB or 256MB by default. These large number of small
files overload the Namenode as it stores namespace for the system and makes it
difficult for Hadoop to function.
• Vulnerable By Nature
Hadoop is written in Java, which is a widely used programming language hence it
is easily exploited by cyber criminals which makes Hadoop vulnerable to security
breaches.
• Processing Overhead
In Hadoop, the data is read from the disk and written to the disk which makes
read/write operations very expensive when we are dealing with tera and petabytes of
data. Hadoop cannot do in-memory calculations hence it incurs processing overhead.
• Potential Stability Issues
Hadoop is an open source platform. That essentially means it is created by the
contributions of the many developers who continue to work on the project. While
improvements are constantly being made, like all open source software, Hadoop has
had its fair share of stability issues.
To avoid these issues, organizations are strongly recommended to make sure they are
running the latest stable version, or run it under a third-party vendor equipped to
handle such problems.
Real time Industry Applications of Hadoop
• Security and Law Enforcement
National security agency of the USA uses Hadoop to prevent terrorist attacks, It is used to
detect and prevent cyber-attacks. Police forces use big data tools to catch criminals and even
predict criminal activity and credit card company’s use of big data use it to detect fraudulent
transactions. Hadoop has a big amount of data which is used to extract meaningful
information from these data. A financial company uses these techniques to search the
customers. Public sector fields such as intelligence, defense, cyber security and scientific
research uses Hadoop to identify fraudulent users
• Managing Traffics on Road
Hadoop is used to development of the country, state, cities by analyzing of data, example
traffic jams can be controlled by uses of Hadoop, it used in the development of a smart city,
It used to improve the transport of the city. It gives proper guidelines for the buses, train, and
another way of transportation.
• Improving business performance by analyzing customer data in real time
Hadoop’s most important uses are in Customer’s requirement understanding. Many
companies like financial, telecom use this technology to find out the customer’s
requirement by analysis big amount of data and extracting this important information
from these data, social media also uses this technology, it keeps on posting an
advertisement on various social media sites to the target customer whenever the user
open their social media on browser, Credit card company uses this technology to
find out the exact customer for their product, they contact the customer through
various way to sell their product
• Improving Sports
Hadoop is also used in the sports field, IBM slamTracker is a tool which is used in
tennis, video analytics used in football and baseball games to improve the
performance of every player, in this day too many sensors are used to improve the
performance of the games.
• Improving Healthcare and Public Health
Hadoop is used in the medical field also to improve the public health many health-
related applications are based on Hadoop only they monitor day to day activities by
this it had a huge amount of public data on the basis of this, it deduces facts which
can be used in medicine to improve the health of the country.
• Financial Trading and Forecasting
Hadoop is used in the trading field. It has a complex algorithm that scan markets
with predefined condition and criteria to find out trading opportunities. It is designed
in a way that it can work without human interaction in case nobody present to
monitor the things according to end-users needs this works. Hadoop is used in high-
frequency trading. Many trading decisions are taken by algorithm only.
• Optimizing Machine Performance
Hadoop is used in a mechanical field also it is used to a developed self-driving car
by the automation, By the proving, the GPS, camera power full sensors, This helps
to run the car without a human driver, uses of Hadoop is playing a very big role in
this field which going to change the coming days.
• Improving Science and Research
Uses of Hadoop is playing a very important role in science and research field also.
Many decision has taken from the extraction of a huge amount the relevant data
which helps to come on the conclusion easily it helps to find out the output with less
effort compared to the earlier time
• Personal Quantification and Performance Optimization
Hadoop is used to improve personal life, it provides many ways to improve day to
day life by monitoring the sleep pattern, morning walk, a whole daily routine of
healthy people. By taking of conclusion from all these patterns it helps to improve
our lives. Many dating sites currently using Hadoop to find out the people with
common interest it helps to find out true love, this is one of a big field in applications
of Hadoop. It helps to improve our health by giving guidelines and suggestion.
• Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster
Hadoop Cluster
• A Hadoop cluster is a collection of computers, known as nodes, that are networked
together to perform these kinds of parallel computations on big data sets.
• Unlike other computer clusters, Hadoop clusters are designed specifically to store
and analyze mass amounts of structured and unstructured data in a distributed
computing environment.
• Hadoop clusters consist of a network of connected master and slave nodes that utilize
high availability, low-cost commodity hardware.
• The ability to linearly scale and quickly add or subtract nodes as volume demands
makes them well-suited to big data analytics jobs with data sets highly variable in
size.
Hadoop Cluster Architecture
Cont..
• Hadoop clusters are composed of a network of master and worker nodes that
orchestrate and execute the various jobs across the Hadoop distributed file system.
• The master nodes typically utilize higher quality hardware and include a
NameNode, Secondary NameNode, and JobTracker, with each running on a separate
machine.
• The workers consist of virtual machines, running both DataNode and TaskTracker
services on commodity hardware, and do the actual work of storing and processing
the jobs as directed by the master nodes.
• The final part of the system are the Client Nodes, which are responsible for loading
the data and fetching the results.
Advantages of a Hadoop Cluster
• Hadoop clusters can boost the processing speed of many big data analytics
jobs, given their ability to break down large computational tasks into smaller
tasks that can be run in a parallel, distributed fashion.
• Hadoop clusters are easily scalable and can quickly add nodes to increase
throughput, and maintain processing speed, when faced with increasing data
blocks.
• The use of low cost, high availability commodity hardware makes Hadoop
clusters relatively easy and inexpensive to set up and maintain.
• Hadoop clusters replicate a data set across the distributed file system, making
them resilient to data loss and cluster failure.
• Hadoop clusters make it possible to integrate and leverage data from multiple
different source systems and data formats.
• It is possible to deploy Hadoop using a single-node installation, for evaluation
purposes.
Tutorial Questions
• Discuss in brief about the building blocks of Hadoop
(or)
Explain the following the terms:
a) Namenode
b) DataNode
c) Secondary Namenode
d) JobTracker
e) Tasktracker.
• Differentiate between HDFS and GFS? Explain in brief about the Architecture of GFS.
• Discuss in brief about the operational modes in Hadoop cluster configuration.
(or)
What are the different modes in which Hadoop can be installed and what is the use of each
mode from application and developer point of view
Tutorial Questions (Cont..)

• Explain the uses of Name node, Data node and Secondary Name node in Hadoop
Distributed File system.
• What is replication factor in HDFS and what is the default value.
• Define Hadoop Cluster? How can you configure Hadoop cluster?
• What are the advantages and disadvantages of Hadoop?
• Define Data node? How does Name node tackle data node failures?
• Discuss in brief about the Name node, Data node, Check point name node and back
up node ?
• What are the real time industry applications of Hadoop?

You might also like