0% found this document useful (0 votes)
40 views

Ebook Hadoop

This document discusses how Hadoop is becoming widely adopted for large scale data analysis. It celebrates Hadoop's 10th anniversary and discusses how its use has expanded beyond just large internet companies. The document also discusses how Hadoop is now being used alongside high performance computing for high performance data analysis. Finally, it discusses why large Hadoop clusters require specialized cluster management software to deploy, monitor, and manage the clusters effectively at large scales.

Uploaded by

sundarms
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Ebook Hadoop

This document discusses how Hadoop is becoming widely adopted for large scale data analysis. It celebrates Hadoop's 10th anniversary and discusses how its use has expanded beyond just large internet companies. The document also discusses how Hadoop is now being used alongside high performance computing for high performance data analysis. Finally, it discusses why large Hadoop clusters require specialized cluster management software to deploy, monitor, and manage the clusters effectively at large scales.

Uploaded by

sundarms
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

A Brief Update on

HADOOP
A look at how Hadoop is finding a place in data
centers around the world, and changing the way
enterprises handle data analysis.

A Publication of Bright Computing


Table of Contents

Happy Birthday Hadoop 3

Hadoop Meets HPC 6

Hadoops Big Impact on Large Scale Computing 9

Why Your Hadoop Cluster Needs a Cluster Manager 12

Plugging the Hole in Hadoop Cluster Management 17


INTRODUCTION
Happy Birthday Hadoop!
Heres to Another Digital Decade

Hard as it might be to believe, 2016 is here and Hadoop is celebrating its tenth

birthday. Thats right. It has been ten years since Apache Hadoop was first intro-

duced to the world, named for the toy elephant belonging to developer Doug

Cuttings then-infant son.

The Big Data Revolution, and Hadoops Role


Big Data was the talk of the town for much of 2015, with businesses that ignore the

possibilities in danger of being left behind. Big Data promises possibilities, reveals

trends, and demands careful use. With the capability to analyze massive amounts of

data, asking the right question is key. In his extensive, nostalgic piece on Hadoop,

Cutting draws attention to the purpose and the mindset leading to Hadoop and by

extension to the Big Data revolution. Hadoop is not the only means to process Big

Data, and Spark has been getting a lot of attention recently for its outstanding per-

formance in realtime applications. However, Hadoop was the first to capture much

attention and remains central to most deployments. At the heart of it all, as Cutting

reminds us, Hadoop arose to meet a need, and has stepped beyond its bonds.
Cutting always intended Hadoop to be open-source and used for broad applications.

Hadoop is the product and the parent of a new age of collaboration, a means to a

more efficient end, a new way of doing software.

Where to Now?
In a recent survey on Big Data reported in ZDNet, businesses are continuing to

embrace Hadoop in larger numbers, attracted to its open-source, collaborative

approach to data processing. Hadoop is cheaper than closed systems, with the

high cost involved in updating maintaining outdated enterprise models. While the

traditional mainframe and data warehouse approach to large data stores and num-

ber-crunching made sense at one time, when it was the only option, newer, scalable

alternatives are here - and everyone knows it. Why pay for large machines which you

may or may not use, and may or may not be able to schedule effectively, when dy-

namic virtual servers and server clusters with automated schedulers do it all so much

more quickly and cheaply? Here is the present-day reality and while the conversion

will continue to take time, it is happening. Hadoop is the future and it is here.
CHAPTER ONE
Hadoop and High
Performance Data Analysis
Big Data Meets HPC

Hadoopadoption continues to grow, however some challenges remain. A Gart-

ner survey released in May, 2015 shows that many respondents report no plans

to invest some saying Hadoop was overkill, others citing skills gaps in their

employees.

Fortunately, much of the skills gap can be closed by deploying software tools that

make the tasks of building and operating a Hadoop cluster easier. But what about

the business value of Hadoop? Is it worth the investment?

According to IDC, many find the answer in High Performance Data Analysis (HPDA)

which combines the benefits of High Performance Computing (HPC) and Big Data

(including Hadoop) clusters. According to IDC, 67% of HPC sites use some form of

HPDA today, and it predicts more than 23% growth in HPDA server deployments in

the next two years.

Expect to see increasing numbers of businesses deploying Hadoop alongside HPC

to solve their business problems in 2016 and beyond.


CHAPTER TWO
Hadoops Big Impact
Large Scale Computing

Most of us were introduced to Hadoop as a tool to help companies extract useful in-

formation from massive amounts of transactional data. The big online retailers use it

to analyze buying patterns and predict what people might want to buy next. Online

advertisers use it to figure out which ad to put in front of us on every web page we

visit. And search vendors use it to help guess what you are really looking for when

you type that adhoc query into your search bar.

The use of Hadoop is not limited to


large search engines or even Internet
marketing...businesses across many
industries are leveraging Hadoop to
reach multiple business goals.
Hadoops ability to absorb any type of data is go unnoticed. It also provides valuable insight

well known. It doesnt matter if the data is struc- into customer behavior by analyzing the way they

tured or not, a Hadoop cluster can analyze just use their mobile devices, exposing activities that

about any data for valuable information. Data might undermine security.

from a variety of sources can be combined in


Hadoop helps pharmaceutical companies quick-
new ways that provide results that other systems
ly root out low performers when bringing new
cant.
drugs to market. It can also shorten the time to

In the world of mobile computing, Hadoop can market after developing a new drug. Others in

analyze customer location, social media activity, the healthcare industry also useHadoop to pro-

and other signals. The results are used to provide vide more personalized patient care.

those customers with services specifically tailored


While Hadoop isnt the solution to all the worlds
to their needs.
problems as the hype machine sometimes

Hadoop has found its way into the financial world claims it ischanging the world of computing

in a number of ways. Its used to detect fraud, in very real ways.

and identify security threats that might otherwise


CHAPTER THREE
Cluster Managers

Why Your Hadoop Cluster Needs One

Every day we hear about more organizations implementing Hadoop as part of their

IT infrastructure. Most people think of Hadoop installation as something that starts

on top of an existing, working Linux cluster. But how did that cluster get there? And

once everything is up and running, how do they keep it that way in order to contin-

ue getting value out of it.

With a small cluster, thats not such a big deal. As clusters get bigger, though, the

work involved in keeping them healthy growsa lot.

Will those scripts you crafted so carefully scale as the cluster grows to hundreds, or

even thousands of nodes? Will the next sysadmin that comes in after you get pro-

moted for your great work be able to do things just the way you planned? Every

time? Maybe not. Using an enterprise-grade cluster manager will make every aspect

of your interaction with your new Hadoop cluster more effective, more efficient and

more consistent.
Easy deployment
It isnt too hard to build a cluster and install Hadoop on it in the lab, but things get

more complicated when you take your lab project into production. Others may not

be as intimate with the system as you are, and as they work to solve the various

issues that come up during a deployment, the sysadmin on site may make changes

that cause problems.

Even if that doesnt happen, theres still the matter of time. Manually installing and

configuring all the parameters necessary to build a functioning cluster and then

installing the Hadoop software on top of that can take a lot of time. The bigger

the cluster, the more time it will take. Possibly more time than the data scientists

waiting for their new cluster want to wait.

Cluster management software automates the installation and configuration process,

not only of the Hadoop software, but also of the operating system software, net-

working software, hardware parameters, disk formatting, and dozens of other little

things that need to be done.


Keeping an eye on things
You may have a top-notch installation and deployment team that can quickly build

and configure clusters of any size anywhere you need them. Enterprise-grade cluster

management software can still be a real benefit in operating, monitoring, and man-

aging the cluster.

They will automatically detect and you of any drive failures, memory problems, over-

heating, and configuration errors.

When you were testing and evaluating your Hadoop cluster in the lab, you had full

control over it, and there wasnt a team of data scientists counting on it being avail-

able whenever they need it. Now that the cluster is in production, you no longer

have the luxury of waiting for things to fail, and bringing nodes back up when you

have time. You have to keep things humming 24x7x365, which requires manage the

Hadoop cluster the same way you would manage any other critical service in the

data center. You need professional grade management tools.

Cluster managers let you keep an eye on all critical metrics in your Hadoop cluster on one screen.
Data protection
While a properly configured Hadoop cluster does a great job of protecting data

through replication across multiple nodes, the systems performance can suffer

when that protection gets called into play. So its a good idea to keep an eye on

the file system itself to make sure it is working properly. Is it spreading the load

evenly amongst available resources? Is the replication factor high enough? Are

some of the drives reaching capacity? An enterprise-grade cluster manager can

answer these questions for you.

Modern enterprise cluster managers provide dozens of other monitoring capabili-

ties and health checks that can be invaluable to anyone responsible for maintaining

a Hadoop cluster.

Managing Change
Change is inevitable, and your cluster will need to be modified after it has been

put into production. For example you may need to add, replace, or remove nodes.

You may need to upgrade or replace operating systems, or other critical software

across a large number of nodes. Cluster managers make light work of such heavy

tasks. They provide administrators with a user-friendly interface to do things like set

parameters, select images, and provide other necessary information that allows the

cluster manager to makes the necessary changes automatically.


CHAPTER FOUR
Plugging the Hole
Hadoop Cluster Management

Apache Hadoop has been around for longer than many people realize. Its now over

a decade old, and has come a very long way since 2002. GigaOms Derrick Harris

did a good job covering Hadoops history in a series of posts titled The history of

Hadoop: From 4 nodes to the future of data.

There is now a thriving community built up to support and extend Hadoop, and it

has spawned a bevy of startups seeking to address various aspects of Hadoop eco-

system for profit. But despite all the progress, theres still an elephant in the room

(sorry I couldnt resist) that nobody talks much about cluster management.

Whats in a Name?
Ambari, the management component of Apache Hadoop, and other management

solutions focus on Hadoop-related software. They dont address the configuration,

installation, and management of the hardware and software Hadoop needs to have

in place before you can even install it.

Most discussions about Hadoop deployment start in the middle of the story. They

assume there is already a working cluster to install Hadoop on. They also ignore
the ongoing monitoring and management of the underlying infrastructure that the

Hadoop cluster is built upon.

So how do you get from bare hardware to the point where you have a functioning

cluster of servers usually running Linux that is fully networked and loaded with

all the right software?

You can build the systems by hand, installing the right combination of operating

system and application software, and configuring the networking hardware and

software properly. You might even use sophisticated scripting tools to help speed up

that process. But there is a better way. Cluster managers exist to fill that gap.

Who needs a cluster manager anyway?


Its true you can build and manage your Hadoop systems without a cluster manager,

but using one brings a number of advantages. First, they can save you a lot of time

up front, when you first rack the servers and load them with software. Cluster man-

agers automate the entire process once the servers are racked and cabled. They do

hardware discovery, read the intended configuration of each machine from a data-

base, and load the software onto them all whether you have 4 nodes or 4000.

They do this repeatedly and consistently, which is handy when you need to deploy

a series of clusters in branches of your organization around the world. Automating

the process reduces the number of configuration errors and missteps that can occur
whenever a complex, multi-step procedure is involved. A good cluster manager will

let you go from bare metal to a working cluster in less than an hour.

The usefulness of cluster managers doesnt end once the Hadoop cluster is de-

ployed. While the Hadoop management software available in the leading distribu-

tions do a great job of monitoring and managing Hadoop, they provide little in-

formation about the underlying cluster that the Hadoop systems runs on. Cluster

managers provide the missing pieces by monitoring and managing the infrastructure

to keep things running properly. They monitor things like CPU temperatures and

disk performance, and can alert operators to problem before they snowball out

of control.

Plan Ahead
So the next time youre discussing Hadoop solutions for your organization, dont

forget to consider the foundation those solutions rest on. Once you put your clus-

ter into production, youll want to be able to manage the entire stack. Consider the

operational procedures that need to happen before you install Hadoop, and the

systems youll need to have in place to keep your cluster healthy AFTER Hadoop is

up and running. I promise you will be glad you took the time to fill the gaps down

the road.
SIGN UP FOR A FREE DEMO

Talk to one of our experts and see how Hadoop


can help you get the results you really want.

FREE DEMO

a publication of

EBHADOOP03032016.1

You might also like