Hadoop Cluster

A Hadoop cluster is a collection of networked computers that store and analyze large amounts of structured and unstructured data in a distributed computing environment. It consists of master and worker nodes, with the master nodes coordinating jobs and the workers storing data and processing jobs. Hadoop clusters can scale easily, are cost effective, and provide fault tolerance through data replication. Hive provides a SQL interface to analyze data stored in Hadoop through tables, partitions, and buckets. It supports SQL operations like filtering, aggregation, and joins to enable analysis of large datasets.

Uploaded by

Anoushka Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views23 pages

Hadoop Cluster

Uploaded by

Anoushka Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Hadoop Cluster

 A Hadoop cluster is a collection of computers, known as nodes, that are networked together to
perform these kinds of parallel computations on big data sets. Unlike other computer clusters,
Hadoop clusters are designed specifically to store and analyze mass amounts of structured and
unstructured data In a distributed computing environment.
Further distinguishing Hadoop ecosystems from other computer clusters are their unique
structure and architecture. Hadoop clusters consist of a network of connected master and slave
nodes that utilize high availability, low-cost commodity hardware. The ability to linearly scale
and quickly add or subtract nodes as volume demands makes them well-suited to
big data analytics jobs with data sets highly variable in size.
Hadoop Cluster Architecture

Hadoop clusters are composed of a network of master and worker nodes that orchestrate and
execute the various jobs across the Hadoop distributed file system. The master nodes typically
utilize higher quality hardware and include a NameNode, Secondary NameNode, and JobTracker,
with each running on a separate machine. The workers consist of virtual machines, running both
DataNode and TaskTracker services on commodity hardware, and do the actual work of storing
and processing the jobs as directed by the master nodes. The final part of the system are the
Client Nodes, which are responsible for loading the data and fetching the results.
Master nodes are responsible for storing data in HDFS and overseeing key operations, such as
running parallel computations on the data using MapReduce.
The worker nodes comprise most of the virtual machines in a Hadoop cluster, and perform the
job of storing the data and running computations. Each worker node runs the DataNode and
TaskTracker services, which are used to receive the instructions from the master nodes.
Client nodes are in charge of loading the data into the cluster. Client nodes first submit
MapReduce jobs describing how data needs to be processed, and then fetch the results once
the processing is finished.
Advantages of a Hadoop Cluster

oHadoop clusters can boost the processing speed of many big data analytics jobs, given their ability to
break down large computational tasks into smaller tasks that can be run in a parallel, distributed
fashion.
oHadoop clusters are easily scalable and can quickly add nodes to increase throughput, and maintain
processing speed, when faced with increasing data blocks.
oThe use of low cost, high availability commodity hardware makes Hadoop clusters relatively easy and
inexpensive to set up and maintain.
oHadoop clusters replicate a data set across the distributed file system, making them resilient to data
loss and cluster failure.
oHadoop clusters make it possible to integrate and leverage data from multiple different source systems
and data formats.
oIt is possible to deploy Hadoop using a single-node installation, for evaluation purposes.
Configuration modes:

Single Node (Local Mode or Standalone Mode)

Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly used for
debugging where you don’t really use HDFS.
You can use input and output both as a local file system in standalone mode.
You also don’t need to do any custom configuration in the files- mapred-site.xml, core-site.xml,
hdfs-site.xml.
Standalone mode is usually the fastest Hadoop modes as it uses the local file system for all the
input and output.
Pseudo-distributed Mode
The pseudo-distributed mode is also known as a single-node cluster where both NameNode and
DataNode will reside on the same machine.
Continue..
In pseudo-distributed mode, all the Hadoop daemons will be running on a single node. Such
configuration is mainly used while testing when we don’t need to think about the resources and
other users sharing the resource.
In this architecture, a separate JVM is spawned for every Hadoop components as they could
communicate across network sockets, effectively producing a fully functioning and optimized
mini-cluster on a single host.
Cluster Types
Cluster load balancing: Load balancing clusters are employed in the situations of augmented
network and internet utilization and these clusters perform as the fundamental factor. This type of
clustering technique offers the benefits of increased network capacity and enhanced performance.
Here the entire nodes stay as cohesive with all the instance where the entire node objects are
completely attentive of the requests those are present in the network.
High–Availability clusters: High-availability clusters (also known as HA clusters , fail-over
clusters or Metroclusters Active/Active) are groups of computers that support server applications
that can be reliably utilized with a minimum amount of down-time. They operate by using
high availability software to harness redundant computers in groups or clusters that provide
continued service when system components fail. Without clustering, if a server running a particular
application crashes, the application will be unavailable until the crashed server is fixed. HA
clustering remedies this situation by detecting hardware/software faults, and immediately
restarting the application on another system without requiring administrative intervention, a
process known as failover.
High-performance clusters: These are also termed as failover clusters. Computers so often faces
failure issues. So, High Availability comes in line with the augmenting dependency of computers
as computers hold crucial responsibility in many of the organizations and applications. In this
approach, redundant computer systems are utilized in the situation of any component
malfunction. So, when there is a single point malfunction, the system seems to be completely
reliable as the network has redundant cluster elements
HIVE
What Is Hive
Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive
scale out and fault tolerance capabilities for data storage and processing on commodity
hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large
volumes of data. It provides SQL which enables users to do ad-hoc querying, summarization and
data analysis easily. At the same time, Hive's SQL gives users multiple places to integrate their
own functionality to do custom analysis, such as User Defined Functions (UDFs).
Built In Operators
Relational Operators—The following operators compare the passed operands and generate a
TRUE or FALSE value, depending on whether the comparison between the operands holds or
not.
Arithmetic Operators—The following operators support various common arithmetic operations
on the operands. All of them return number types.
Logical Operators — The following operators provide support for creating logical expressions. All
of them return boolean TRUE or FALSE depending upon the boolean values of the operands.
Operators on Complex Types—The following operators provide mechanisms to access elements
in Complex Types
Data Units

In the order of granularity - Hive data is organized into:

•Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so
on. Databases can also be used to enforce security for a user or group of users.
•Tables: Homogeneous units of data which have the same schema. An example of a table could be
page_views table, where each row could comprise of the following columns (schema):
•timestamp—which is of INT type that corresponds to a UNIX timestamp of when the page was
viewed.
•userid —which is of BIGINT type that identifies the user who viewed the page.
•page_url—which is of STRING type that captures the location of the page.
•referer_url—which is of STRING that captures the location of the page from where the user arrived at
the current page.
•IP—which is of STRING type that captures the IP address from where the page request was made.
Partitions: Each Table can have one or more partition Keys which determines how the data is stored.
Partitions—apart from being storage units—also allow the user to efficiently identify the rows that
satisfy a specified criteria; for example, a date_partition of type STRING and country_partition of type
STRING. Each unique value of the partition keys defines a partition of the Table. For example, all "US"
data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the
"US" data for 2009-12-23, you can run that query only on the relevant partition of the table, thereby
speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23
does not mean that it contains all or only data from that date; partitions are named after dates for
convenience; it is the user's job to guarantee the relationship between partition name and data content!
Partition columns are virtual columns, they are not part of the data itself but are derived on load.
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a
hash function of some column of the Table. For example the page_views table may be bucketed by
userid, which is one of the columns, other than the partitions columns, of the page_view table. These
can be used to efficiently sample the data.
Language Capabilities
Hive's SQL provides the basic SQL operations. These operations work on tables or partitions. These operations are:
Ability to filter rows from a table using a WHERE clause.
Ability to select certain columns from the table using a SELECT clause.
Ability to do equi-joins between two tables.
Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table.
Ability to store the results of a query into another table.
Ability to download the contents of a table to a local (for example,, nfs) directory.
Ability to store the results of a query in a hadoop dfs directory.
Ability to manage tables and partitions (create, drop and alter).
Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.
Usage and Examples

Creating, Showing, Altering, and Dropping Tables

Loading Data
Querying and Inserting Data
Apache Oozie
Apache Oozie is the tool in which all sort of programs can be pipelined in a desired order to
work in Hadoop’s distributed environment. Oozie also provides a mechanism to run the job at a
given schedule.
One of the main advantages of Oozie is that it is tightly integrated with Hadoop stack supporting
various Hadoop jobs like Hive, Pig, Sqoop as well as system-specific jobs like Java and Shell.
Oozie is an Open Source Java Web-Application available under Apache license 2.0. It is
responsible for triggering the workflow actions, which in turn uses the Hadoop execution engine
to actually execute the task. Hence, Oozie is able to leverage the existing Hadoop machinery for
load balancing, fail-over, etc.
Oozie detects completion of tasks through callback and polling. When Oozie starts a task, it
provides a unique callback HTTP URL to the task, and notifies that URL when it is complete. If
the task fails to invoke the callback URL, Oozie can poll the task for completion.
Following three types of jobs are
common in Oozie −
Oozie Workflow Jobs − These are represented as Directed Acyclic Graphs (DAGs) to specify a
sequence of actions to be executed.
Oozie Coordinator Jobs − These consist of workflow jobs triggered by time and data availability.
Oozie Bundle − These can be referred to as a package of multiple coordinator and workflow
jobs.
What is Flume?

Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and

transporting large amounts of streaming data such as log files, events (etc...) from various
sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy
streaming data (log data) from various web servers to HDFS.
Applications of Flume

Assume an e-commerce web application wants to analyze the customer behavior from a
particular region. To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.
Flume is used to move the log data generated by application servers into HDFS at a higher
speed.
Advantages of Flume

Using Apache Flume we can store the data in to any of the centralized stores (HBase, HDFS).
When the rate of incoming data exceeds the rate at which data can be written to the
destination, Flume acts as a mediator between data producers and the centralized stores and
provides a steady flow of data between them.
Flume provides the feature of contextual routing.
The transactions in Flume are channel-based where two transactions (one sender and one
receiver) are maintained for each message. It guarantees reliable message delivery.
Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Features of Flume
Flume ingests log data from multiple web servers into a centralized store (HDFS, HBase)
efficiently.
Using Flume, we can get the data from multiple servers immediately into Hadoop.
Along with the log files, Flume is also used to import huge volumes of event data produced by
social networking sites like Facebook and Twitter, and e-commerce websites like Amazon and
Flipkart.
Flume supports a large set of sources and destinations types.
Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
Flume can be scaled horizontally.

SAP CLOUD Connectivity - Service
No ratings yet
SAP CLOUD Connectivity - Service
534 pages
Process Models: Data Flow Diagrams
No ratings yet
Process Models: Data Flow Diagrams
31 pages
bioDiesel_research
No ratings yet
bioDiesel_research
29 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Hadoop Interview Questions - HDFS
No ratings yet
Hadoop Interview Questions - HDFS
19 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
Big Data
No ratings yet
Big Data
67 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit II Big Data
No ratings yet
Unit II Big Data
27 pages
HADOOP
No ratings yet
HADOOP
18 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Big Data
No ratings yet
Big Data
51 pages
Unit-I
No ratings yet
Unit-I
38 pages
Hadoop Class 1 PDF
No ratings yet
Hadoop Class 1 PDF
27 pages
Apache Hadoop and Hive: Dhruba Borthakur
No ratings yet
Apache Hadoop and Hive: Dhruba Borthakur
32 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
HADOOP NOTES
No ratings yet
HADOOP NOTES
8 pages
Welcome To:: Unit 2 - Introduction To Big Hadoop
No ratings yet
Welcome To:: Unit 2 - Introduction To Big Hadoop
60 pages
Unit 3
No ratings yet
Unit 3
61 pages
Bda Answer Bank: 1. Characteristics of Big Data 5V
No ratings yet
Bda Answer Bank: 1. Characteristics of Big Data 5V
28 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
Chapter 2 - 大数据生态系统
No ratings yet
Chapter 2 - 大数据生态系统
31 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Hadoop & HDFS Final
No ratings yet
Hadoop & HDFS Final
31 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
Unit 3(Big Data Analytics)
No ratings yet
Unit 3(Big Data Analytics)
18 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Hadoop 2
No ratings yet
Hadoop 2
27 pages
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
No ratings yet
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
6 pages
Big Data Introduction & Ecosystems
No ratings yet
Big Data Introduction & Ecosystems
4 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Had Oop Introduction
No ratings yet
Had Oop Introduction
4 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
Facebook DG
No ratings yet
Facebook DG
5 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
Unit 5
No ratings yet
Unit 5
10 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
No ratings yet
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
6 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Demystifying The Big Data Ecosystem... - Param Natarajan
100% (1)
Demystifying The Big Data Ecosystem... - Param Natarajan
8 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
21 pages
HADOOP
No ratings yet
HADOOP
55 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
ICT Assessment 5 - 38
No ratings yet
ICT Assessment 5 - 38
6 pages
OnApp+6 7+Installation+Guide
No ratings yet
OnApp+6 7+Installation+Guide
73 pages
Emerging Modes of Business
No ratings yet
Emerging Modes of Business
1 page
ITGI Global Status Report 2006
No ratings yet
ITGI Global Status Report 2006
48 pages
Ict-distributed Systems
No ratings yet
Ict-distributed Systems
6 pages
Internship Report Format.docx (1)
No ratings yet
Internship Report Format.docx (1)
27 pages
Multiple Ch1
No ratings yet
Multiple Ch1
4 pages
DWDM LAB Manual SVEC-16
No ratings yet
DWDM LAB Manual SVEC-16
8 pages
Ch8 Data Wrangling Join, Combine, and Reshape
No ratings yet
Ch8 Data Wrangling Join, Combine, and Reshape
13 pages
Grade 4 TOS
No ratings yet
Grade 4 TOS
1 page
Elevator Pitch Differentiators Competitors: Battle Card
No ratings yet
Elevator Pitch Differentiators Competitors: Battle Card
2 pages
SAP - ERP Introduction
No ratings yet
SAP - ERP Introduction
5 pages
User Manual ETG
No ratings yet
User Manual ETG
36 pages
CyberArk Security Fundamentals For Privileged Account Security
No ratings yet
CyberArk Security Fundamentals For Privileged Account Security
6 pages
The 2024 Cyber Security Employee Handbook - Field Effect
No ratings yet
The 2024 Cyber Security Employee Handbook - Field Effect
13 pages
Digital Lite
No ratings yet
Digital Lite
3 pages
CV Nicolas H. Sulca Vega 2024 Eng
No ratings yet
CV Nicolas H. Sulca Vega 2024 Eng
2 pages
SAP S4HANA Migration Cockpit - Part 1
No ratings yet
SAP S4HANA Migration Cockpit - Part 1
3 pages
Topic 4
No ratings yet
Topic 4
30 pages
Android Security
No ratings yet
Android Security
22 pages
Server Installation and Configuration Guide
No ratings yet
Server Installation and Configuration Guide
135 pages
Cloud Hosting Sla
No ratings yet
Cloud Hosting Sla
7 pages
Recommended
No ratings yet
Recommended
1 page
SDLC
No ratings yet
SDLC
28 pages
Wilp Data Science Semester Program
No ratings yet
Wilp Data Science Semester Program
2 pages
Fortinet NSE4 FGT Jan 2024
No ratings yet
Fortinet NSE4 FGT Jan 2024
191 pages
Paulo Ads Dark
No ratings yet
Paulo Ads Dark
3 pages
Cat® Minestar™ Health - Vims Converter Subscription Pricing and Terms & Conditions
No ratings yet
Cat® Minestar™ Health - Vims Converter Subscription Pricing and Terms & Conditions
3 pages

Hadoop Cluster

Uploaded by

Hadoop Cluster

Uploaded by

Hadoop Cluster

Single Node (Local Mode or Standalone Mode)

In the order of granularity - Hive data is organized into:

Creating, Showing, Altering, and Dropping Tables

Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and

You might also like