Ebook Hadoop
Ebook Hadoop
HADOOP
A look at how Hadoop is finding a place in data
centers around the world, and changing the way
enterprises handle data analysis.
Hard as it might be to believe, 2016 is here and Hadoop is celebrating its tenth
birthday. Thats right. It has been ten years since Apache Hadoop was first intro-
duced to the world, named for the toy elephant belonging to developer Doug
possibilities in danger of being left behind. Big Data promises possibilities, reveals
trends, and demands careful use. With the capability to analyze massive amounts of
data, asking the right question is key. In his extensive, nostalgic piece on Hadoop,
Cutting draws attention to the purpose and the mindset leading to Hadoop and by
extension to the Big Data revolution. Hadoop is not the only means to process Big
Data, and Spark has been getting a lot of attention recently for its outstanding per-
formance in realtime applications. However, Hadoop was the first to capture much
attention and remains central to most deployments. At the heart of it all, as Cutting
reminds us, Hadoop arose to meet a need, and has stepped beyond its bonds.
Cutting always intended Hadoop to be open-source and used for broad applications.
Hadoop is the product and the parent of a new age of collaboration, a means to a
Where to Now?
In a recent survey on Big Data reported in ZDNet, businesses are continuing to
approach to data processing. Hadoop is cheaper than closed systems, with the
high cost involved in updating maintaining outdated enterprise models. While the
traditional mainframe and data warehouse approach to large data stores and num-
ber-crunching made sense at one time, when it was the only option, newer, scalable
alternatives are here - and everyone knows it. Why pay for large machines which you
may or may not use, and may or may not be able to schedule effectively, when dy-
namic virtual servers and server clusters with automated schedulers do it all so much
more quickly and cheaply? Here is the present-day reality and while the conversion
will continue to take time, it is happening. Hadoop is the future and it is here.
CHAPTER ONE
Hadoop and High
Performance Data Analysis
Big Data Meets HPC
ner survey released in May, 2015 shows that many respondents report no plans
to invest some saying Hadoop was overkill, others citing skills gaps in their
employees.
Fortunately, much of the skills gap can be closed by deploying software tools that
make the tasks of building and operating a Hadoop cluster easier. But what about
According to IDC, many find the answer in High Performance Data Analysis (HPDA)
which combines the benefits of High Performance Computing (HPC) and Big Data
(including Hadoop) clusters. According to IDC, 67% of HPC sites use some form of
HPDA today, and it predicts more than 23% growth in HPDA server deployments in
Most of us were introduced to Hadoop as a tool to help companies extract useful in-
formation from massive amounts of transactional data. The big online retailers use it
to analyze buying patterns and predict what people might want to buy next. Online
advertisers use it to figure out which ad to put in front of us on every web page we
visit. And search vendors use it to help guess what you are really looking for when
well known. It doesnt matter if the data is struc- into customer behavior by analyzing the way they
tured or not, a Hadoop cluster can analyze just use their mobile devices, exposing activities that
about any data for valuable information. Data might undermine security.
In the world of mobile computing, Hadoop can market after developing a new drug. Others in
analyze customer location, social media activity, the healthcare industry also useHadoop to pro-
and other signals. The results are used to provide vide more personalized patient care.
Hadoop has found its way into the financial world claims it ischanging the world of computing
Every day we hear about more organizations implementing Hadoop as part of their
on top of an existing, working Linux cluster. But how did that cluster get there? And
once everything is up and running, how do they keep it that way in order to contin-
With a small cluster, thats not such a big deal. As clusters get bigger, though, the
Will those scripts you crafted so carefully scale as the cluster grows to hundreds, or
even thousands of nodes? Will the next sysadmin that comes in after you get pro-
moted for your great work be able to do things just the way you planned? Every
time? Maybe not. Using an enterprise-grade cluster manager will make every aspect
of your interaction with your new Hadoop cluster more effective, more efficient and
more consistent.
Easy deployment
It isnt too hard to build a cluster and install Hadoop on it in the lab, but things get
more complicated when you take your lab project into production. Others may not
be as intimate with the system as you are, and as they work to solve the various
issues that come up during a deployment, the sysadmin on site may make changes
Even if that doesnt happen, theres still the matter of time. Manually installing and
configuring all the parameters necessary to build a functioning cluster and then
installing the Hadoop software on top of that can take a lot of time. The bigger
the cluster, the more time it will take. Possibly more time than the data scientists
not only of the Hadoop software, but also of the operating system software, net-
working software, hardware parameters, disk formatting, and dozens of other little
and configure clusters of any size anywhere you need them. Enterprise-grade cluster
management software can still be a real benefit in operating, monitoring, and man-
They will automatically detect and you of any drive failures, memory problems, over-
When you were testing and evaluating your Hadoop cluster in the lab, you had full
control over it, and there wasnt a team of data scientists counting on it being avail-
able whenever they need it. Now that the cluster is in production, you no longer
have the luxury of waiting for things to fail, and bringing nodes back up when you
have time. You have to keep things humming 24x7x365, which requires manage the
Hadoop cluster the same way you would manage any other critical service in the
Cluster managers let you keep an eye on all critical metrics in your Hadoop cluster on one screen.
Data protection
While a properly configured Hadoop cluster does a great job of protecting data
through replication across multiple nodes, the systems performance can suffer
when that protection gets called into play. So its a good idea to keep an eye on
the file system itself to make sure it is working properly. Is it spreading the load
evenly amongst available resources? Is the replication factor high enough? Are
ties and health checks that can be invaluable to anyone responsible for maintaining
a Hadoop cluster.
Managing Change
Change is inevitable, and your cluster will need to be modified after it has been
put into production. For example you may need to add, replace, or remove nodes.
You may need to upgrade or replace operating systems, or other critical software
across a large number of nodes. Cluster managers make light work of such heavy
tasks. They provide administrators with a user-friendly interface to do things like set
parameters, select images, and provide other necessary information that allows the
Apache Hadoop has been around for longer than many people realize. Its now over
a decade old, and has come a very long way since 2002. GigaOms Derrick Harris
did a good job covering Hadoops history in a series of posts titled The history of
There is now a thriving community built up to support and extend Hadoop, and it
has spawned a bevy of startups seeking to address various aspects of Hadoop eco-
system for profit. But despite all the progress, theres still an elephant in the room
(sorry I couldnt resist) that nobody talks much about cluster management.
Whats in a Name?
Ambari, the management component of Apache Hadoop, and other management
installation, and management of the hardware and software Hadoop needs to have
Most discussions about Hadoop deployment start in the middle of the story. They
assume there is already a working cluster to install Hadoop on. They also ignore
the ongoing monitoring and management of the underlying infrastructure that the
So how do you get from bare hardware to the point where you have a functioning
cluster of servers usually running Linux that is fully networked and loaded with
You can build the systems by hand, installing the right combination of operating
system and application software, and configuring the networking hardware and
software properly. You might even use sophisticated scripting tools to help speed up
that process. But there is a better way. Cluster managers exist to fill that gap.
but using one brings a number of advantages. First, they can save you a lot of time
up front, when you first rack the servers and load them with software. Cluster man-
agers automate the entire process once the servers are racked and cabled. They do
hardware discovery, read the intended configuration of each machine from a data-
base, and load the software onto them all whether you have 4 nodes or 4000.
They do this repeatedly and consistently, which is handy when you need to deploy
the process reduces the number of configuration errors and missteps that can occur
whenever a complex, multi-step procedure is involved. A good cluster manager will
let you go from bare metal to a working cluster in less than an hour.
The usefulness of cluster managers doesnt end once the Hadoop cluster is de-
ployed. While the Hadoop management software available in the leading distribu-
tions do a great job of monitoring and managing Hadoop, they provide little in-
formation about the underlying cluster that the Hadoop systems runs on. Cluster
managers provide the missing pieces by monitoring and managing the infrastructure
to keep things running properly. They monitor things like CPU temperatures and
disk performance, and can alert operators to problem before they snowball out
of control.
Plan Ahead
So the next time youre discussing Hadoop solutions for your organization, dont
forget to consider the foundation those solutions rest on. Once you put your clus-
ter into production, youll want to be able to manage the entire stack. Consider the
operational procedures that need to happen before you install Hadoop, and the
systems youll need to have in place to keep your cluster healthy AFTER Hadoop is
up and running. I promise you will be glad you took the time to fill the gaps down
the road.
SIGN UP FOR A FREE DEMO
FREE DEMO
a publication of
EBHADOOP03032016.1