Hadoop Cluster
Hadoop Cluster
A Hadoop cluster is a collection of computers, known as nodes, that are networked together to
perform these kinds of parallel computations on big data sets. Unlike other computer clusters,
Hadoop clusters are designed specifically to store and analyze mass amounts of structured and
unstructured data In a distributed computing environment.
Further distinguishing Hadoop ecosystems from other computer clusters are their unique
structure and architecture. Hadoop clusters consist of a network of connected master and slave
nodes that utilize high availability, low-cost commodity hardware. The ability to linearly scale
and quickly add or subtract nodes as volume demands makes them well-suited to
big data analytics jobs with data sets highly variable in size.
Hadoop Cluster Architecture
Hadoop clusters are composed of a network of master and worker nodes that orchestrate and
execute the various jobs across the Hadoop distributed file system. The master nodes typically
utilize higher quality hardware and include a NameNode, Secondary NameNode, and JobTracker,
with each running on a separate machine. The workers consist of virtual machines, running both
DataNode and TaskTracker services on commodity hardware, and do the actual work of storing
and processing the jobs as directed by the master nodes. The final part of the system are the
Client Nodes, which are responsible for loading the data and fetching the results.
Master nodes are responsible for storing data in HDFS and overseeing key operations, such as
running parallel computations on the data using MapReduce.
The worker nodes comprise most of the virtual machines in a Hadoop cluster, and perform the
job of storing the data and running computations. Each worker node runs the DataNode and
TaskTracker services, which are used to receive the instructions from the master nodes.
Client nodes are in charge of loading the data into the cluster. Client nodes first submit
MapReduce jobs describing how data needs to be processed, and then fetch the results once
the processing is finished.
Advantages of a Hadoop Cluster
oHadoop clusters can boost the processing speed of many big data analytics jobs, given their ability to
break down large computational tasks into smaller tasks that can be run in a parallel, distributed
fashion.
oHadoop clusters are easily scalable and can quickly add nodes to increase throughput, and maintain
processing speed, when faced with increasing data blocks.
oThe use of low cost, high availability commodity hardware makes Hadoop clusters relatively easy and
inexpensive to set up and maintain.
oHadoop clusters replicate a data set across the distributed file system, making them resilient to data
loss and cluster failure.
oHadoop clusters make it possible to integrate and leverage data from multiple different source systems
and data formats.
oIt is possible to deploy Hadoop using a single-node installation, for evaluation purposes.
Configuration modes:
Assume an e-commerce web application wants to analyze the customer behavior from a
particular region. To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.
Flume is used to move the log data generated by application servers into HDFS at a higher
speed.
Advantages of Flume
Using Apache Flume we can store the data in to any of the centralized stores (HBase, HDFS).
When the rate of incoming data exceeds the rate at which data can be written to the
destination, Flume acts as a mediator between data producers and the centralized stores and
provides a steady flow of data between them.
Flume provides the feature of contextual routing.
The transactions in Flume are channel-based where two transactions (one sender and one
receiver) are maintained for each message. It guarantees reliable message delivery.
Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Features of Flume
Flume ingests log data from multiple web servers into a centralized store (HDFS, HBase)
efficiently.
Using Flume, we can get the data from multiple servers immediately into Hadoop.
Along with the log files, Flume is also used to import huge volumes of event data produced by
social networking sites like Facebook and Twitter, and e-commerce websites like Amazon and
Flipkart.
Flume supports a large set of sources and destinations types.
Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
Flume can be scaled horizontally.