0% found this document useful (0 votes)

23 views

BigData Hadoop Online Training by Experts

Hadoop Training, Enhance your Big data subject knowledge with Online Training without wasting your time. Register for Free LIVE DEMO Class. For more info: http://www.hadooponlinetutor.com Contact Us: 8121660044 732-419-2619 http://www.hadooponlinetutor.com

Uploaded by

Harika583

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

BigData Hadoop Online Training by Experts

Uploaded by

Harika583

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Hadoop Video/Online Training by Expert

8121660044
732-419-2619

Site: http://www.hadooponlinetutor.com

Introduction
Big Data:
Big data is a term used to describe the voluminous amount of
unstructured and semi-structured data a company creates.
Data that would take too much time and cost too much money to load
into a relational database for analysis.
Big data doesn't refer to any specific quantity, the term is often used
when speaking about petabytes and exabytes of data.

http://www.hadooponlinetutor.com

The New York Stock Exchange generates about one terabyte of new trade data per day.

Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.

Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.

The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20
terabytes per month.

The Large Hadron Collider near Geneva, Switzerland, produces about 15 petabytes of
data per year.

http://www.hadooponlinetutor.com

What Caused The Problem?

1
2

Standard Hard Drive Size

(in Mb)

Year
1990
2010

1370
1000000

Data Transfer Rate

(Mbps)

Year
1990

4.4

2010

100

http://www.hadooponlinetutor.com

So What Is The Problem?

The transfer speed is around 100 MB/s

A standard disk is 1 Terabyte

Time to read entire disk= 10000 seconds or 3 Hours!

Increase in processing time may not be as helpful because

Network bandwidth is now more of a limiting factor

Physical limits of processor chips have been reached

http://www.hadooponlinetutor.com

So What do We Do?
The obvious solution is that we use
multiple processors to solve the same
problem by fragmenting it into pieces.
Imagine if we had 100 drives, each
holding one hundredth of the data.
Working in parallel, we could read the
data in under two minutes.

http://www.hadooponlinetutor.com

Distributed Computing Vs
Parallelization
Parallelization-

Multiple processors or CPUs

in a single machine
Distributed Computing- Multiple computers
connected via a network

http://www.hadooponlinetutor.com

Examples
Cray-2 was a four-processor ECL
vector supercomputer made by
Cray Research starting in 1985

http://www.hadooponlinetutor.com

Distributed Computing
The key issues involved in this Solution:
Hardware failure
Combine the data after analysis
Network Associated Problems

http://www.hadooponlinetutor.com

What Can We Do With A Distributed

Computer System?
IBM

Deep Blue
Multiplying Large Matrices
Simulating several 100s of charactersLOTRs
Index the Web (Google)
Simulating an internet size network for
network experiments
http://www.hadooponlinetutor.com

Problems In Distributed Computing

Hardware Failure:
As soon as we start using many pieces of
hardware, the chance that one will fail is fairly
high.
Combine the data after analysis:
Most analysis tasks need to be able to combine
the data in some way; data read from one
disk may need to be combined with the data
from any of the other 99 disks.

http://www.hadooponlinetutor.com

To The Rescue!

Apache Hadoop is a framework for running

applications on large cluster built of
commodity hardware.
A common way of avoiding data loss is through
replication: redundant copies of the data are kept by the
system so that in the event of failure, there is another copy
available. The Hadoop Distributed Filesystem (HDFS),
takes care of this problem.
The second problem
is solved by a simple programming
http://www.hadooponlinetutor.com
model- Mapreduce. Hadoop is the popular open source

What Else is Hadoop?

A reliable shared storage and analysis system.
There are other subprojects of Hadoop that provide complementary
services, or build on the core to add higher-level abstractions The various
subprojects of hadoop include:
1.
2.
3.
4.
5.
6.
7.

Core
Avro
Pig
HBase
Zookeeper
Hive
Chukwa

http://www.hadooponlinetutor.com

Hadoop Approach to Distributed

Computing

The theoretical 1000-CPU machine would cost a very large amount of money,
far more than 1,000 single-CPU.

Hadoop will tie these smaller and more reasonably priced machines together
into a single cost-effective compute cluster.

Hadoop provides a simplified programming model which allows the user to

quickly write and test distributed systems, and its efficient, automatic
distribution of data and work across machines and in turn utilizing the
underlying parallelism of the CPU cores.

http://www.hadooponlinetutor.com

MapReduce

http://www.hadooponlinetutor.com

MapReduce

Hadoop limits the amount of communication which can be performed by the

processes, as each individual record is processed by a task in isolation from one another

By restricting the communication between nodes, Hadoop makes the distributed system
much more reliable. Individual node failures can be worked around by restarting tasks
on other machines.

The other workers continue to operate as though nothing went wrong, leaving the
challenging aspects of partially restarting the program to the underlying Hadoop layer.

Map : (in_value,in_key)(out_key, intermediate_value)

Reduce: (out_key, intermediate_value) (out_value list)

http://www.hadooponlinetutor.com

What is MapReduce?

MapReduce is a programming model

Programs written in this functional style are automatically parallelized and
executed on a large cluster of commodity machines
MapReduce is an associated implementation for processing and generating
large data sets.

http://www.hadooponlinetutor.com

The Programming Model Of MapReduce

Map, written by the user, takes an input pair and produces a set of intermediate
key/value pairs. The MapReduce library groups together all intermediate values
associated with the same intermediate key I and passes them to the Reduce
function.

http://www.hadooponlinetutor.com
http://www.hadooponlinetutor.com

The Reduce function, also written by the user, accepts an intermediate key I and a set of values
for that key. It merges together these values to form a possibly smaller set of values

http://www.hadooponlinetutor.com

This abstraction allows us to handle lists of values that are too large to fit in memory.

Example:

// key: document name

// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));

http://www.hadooponlinetutor.com

Orientation of Nodes
Data Locality Optimization:
The computer nodes and the storage nodes are the same. The Map-Reduce
framework and the Distributed File System run on the same set of nodes. This
configuration allows the framework to effectively schedule tasks on the nodes where
data is already present, resulting in very high aggregate bandwidth across the
cluster.
If this is not possible: The computation is done by another processor on the same
rack.
http://www.hadooponlinetutor.com
Moving Computation is Cheaper than Moving Data

How MapReduce Works

A Map-Reduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner.

The framework sorts the outputs of the maps, which are then input to the reduce tasks.

Typically both the input and the output of the job are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the failed
tasks.

A MapReduce job is a unit of work that the client wants to be performed: it consists of
the input data, the MapReduce program, and configuration information. Hadoop runs
the job by dividing it into tasks, of which there are two types: map tasks and reduce
tasks

http://www.hadooponlinetutor.com

Fault Tolerance

There are two types of nodes that control the job execution process: tasktrackers and
jobtrackers

The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on
tasktrackers.

Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record
of the overall progress of each job.

If a tasks fails, the jobtracker can reschedule it on a different tasktracker.

http://www.hadooponlinetutor.com

Input Splits
Input splits: Hadoop divides the input to a MapReduce job into fixed-size
pieces called input splits, or just splits. Hadoop creates one map task for each
split, which runs the user-defined map function for each record in the split.
The quality of the load balancing increases as the splits become more finegrained.
BUT if splits are too small, then the overhead of managing the splits and of map
task creation begins to dominate the total job execution time. For most jobs, a
good split size tends to be the size of a HDFS block, 64 MB by default.
WHY?
Map tasks write their output to local disk, not to HDFS. Map output is
intermediate output: its processed by reduce tasks to produce the final output,
and once the job is complete the map output can be thrown away. So storing it
in HDFS, with replication, would be a waste of time. It is also possible that the
node running the map task fails before the map output has been consumed by
the reduce task.

http://www.hadooponlinetutor.com

Input to Reduce Tasks

Reduce

tasks dont have the advantage of

data localitythe input to a single reduce
task is normally the output from all mappers.

http://www.hadooponlinetutor.com

http://www.hadooponlinetutor.com
MapReduce data flow with a single reduce task

http://www.hadooponlinetutor.com
MapReduce data flow with multiple reduce tasks

http://www.hadooponlinetutor.com
MapReduce data flow with no reduce tasks

Combiner Functions
Many MapReduce jobs are limited by the bandwidth available on the cluster.
In order to minimize the data transferred between the map and reduce tasks, combiner
functions are introduced.
Hadoop allows the user to specify a combiner function to be run on the map outputthe
combiner functions output forms the input to the reduce function.
Combiner finctions can help cut down the amount of data shuffled between the maps and
the reduces.

http://www.hadooponlinetutor.com

Hadoop Streaming:
Hadoop provides an API to MapReduce that allows you to
write your map and reduce functions in languages other than
Java.
Hadoop Streaming uses Unix standard streams as the
interface between Hadoop and your program, so you can use
any language that can read standard input and write to
standard output to write your MapReduce program.

http://www.hadooponlinetutor.com

Hadoop Pipes:
Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.
Unlike Streaming, which uses standard input and output to communicate
with the map and reduce code, Pipes uses sockets as the channel over
which the tasktracker communicates with the process running the C++ map
or reduce function. JNI is not used.

http://www.hadooponlinetutor.com

HADOOP DISTRIBUTED
FILESYSTEM (HDFS)

Filesystems that manage the storage across a network of machines are called
distributed filesystems.

Hadoop comes with a distributed filesystem called HDFS, which stands for
Hadoop Distributed Filesystem.

HDFS, the Hadoop Distributed File System, is a distributed file system

designed to hold very large amounts of data (terabytes or even petabytes),
and provide high-throughput access to this information.

http://www.hadooponlinetutor.com

Problems In Distributed File Systems

Making distributed filesystems is more complex than regular disk filesystems. This
is because the data is spanned over multiple nodes, so all the complications of
network programming kick in.
Hardware Failure
An HDFS instance may consist of hundreds or thousands of server machines, each storing
part of the file systems data. The fact that there are a huge number of components and that
each component has a non-trivial probability of failure means that some component of HDFS
is always non-functional. Therefore, detection of faults and quick, automatic recovery from
them is a core architectural goal of HDFS.
Large Data Sets
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to
terabytes in size. Thus, HDFS is tuned to support large files. It should provide high
aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should
support tens of millions of files in a single instance.

http://www.hadooponlinetutor.com

Goals of HDFS
Streaming Data Access
Applications that run on HDFS need streaming access to their data sets.
They are not general purpose applications that typically run on general
purpose file systems. HDFS is designed more for batch processing rather
than interactive use by users. The emphasis is on high throughput of data
access rather than low latency of data access. POSIX imposes many hard
requirements that are not needed for applications that are targeted for
HDFS. POSIX semantics in a few key areas has been traded to increase
data throughput rates.
Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A
file once created, written, and closed need not be changed. This
assumption simplifies data coherency issues and enables high throughput
data access. A Map/Reduce application or a web crawler application fits
perfectly with this model. There is a planhttp://www.hadooponlinetutor.com
to support appending-writes to

Moving Computation is Cheaper than Moving Data

A computation requested by an application is much more efficient if
it is executed near the data it operates on. This is especially true when
the size of the data set is huge. This minimizes network congestion
and increases the overall throughput of the system. The assumption is
that it is often better to migrate the computation closer to where the
data is located rather than moving the data to where the application is
running. HDFS provides interfaces for applications to move
themselves closer to where the data is located.
Portability Across Heterogeneous Hardware and Software
Platforms HDFS has been designed to be easily portable from
one platform to another. This facilitates widespread adoption
of HDFS as a platform of choice for a large set of
applications.
http://www.hadooponlinetutor.com

Design of HDFS

Very large files

Files that are hundreds of megabytes, gigabytes, or terabytes in size. There
are Hadoop clusters running today that store petabytes of data.

Streaming data access

HDFS is built around the idea that the most efficient data processing pattern
is a write-once, read-many-times pattern.
A dataset is typically generated or copied from source, then various
analyses are performed on that dataset over time. Each analysis will involve
a large proportion of the dataset, so the time to read the whole dataset is
more important than the latency in reading the first record.

http://www.hadooponlinetutor.com

Low-latency data access

Applications that require low-latency access to data, in the tens
of milliseconds
range, will not work well with HDFS. Remember HDFS is
optimized for delivering a high throughput of data, and this may
be at the expense of latency. HBase (Chapter 12) is currently a
better choice for low-latency access.
Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are
always made at the end of the file. There is no support for
multiple writers, or for modifications at arbitrary offsets in the
file. (These might be supported in the future, but they are likely
to be relatively inefficient.)
http://www.hadooponlinetutor.com

Lots of small files

Since the namenode holds filesystem metadata in memory, the limit to
the number of files in a filesystem is governed by the amount of
memory on the namenode. As a rule of thumb, each file, directory, and
block takes about 150 bytes. So, for example, if you had one million
files, each taking one block, you would need at least 300 MB of
memory. While storing millions of files is feasible, billions is beyond the
capability of current hardware.

http://www.hadooponlinetutor.com

Commodity hardware
Hadoop doesnt require expensive, highly reliable hardware to run on.
Its designed to run on clusters of commodity hardware for which the
chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure. It is also worth
examining the applications for which using HDFS does not work so
well. While this may change in the future, these are areas where HDFS
is not a good fit today:

http://www.hadooponlinetutor.com

Contact Us:
Our Address:
#444, 4th floor, Gumidelli Commercial Complex
Reliance Trends Building
Begumpet, Hyderabad
Phone:
USA : +1 732-419-2619
INDIA: +91 8121660044
Email:
[email protected]

Website: http://www.hadooponlinetutor.com

Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Best Hadoop Online Training
No ratings yet
Best Hadoop Online Training
41 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Big Data
No ratings yet
Big Data
43 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
1 MapReduce introduction with example
No ratings yet
1 MapReduce introduction with example
52 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Hadoop by Dr. Kamal Gulati
No ratings yet
Hadoop by Dr. Kamal Gulati
33 pages
6. Map Reduce Programming
No ratings yet
6. Map Reduce Programming
67 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
BDA_UNIT_2
No ratings yet
BDA_UNIT_2
48 pages
Bigdata
No ratings yet
Bigdata
6 pages
Hadoop OnePage
No ratings yet
Hadoop OnePage
2 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
Map Red
No ratings yet
Map Red
6 pages
Big Data
No ratings yet
Big Data
67 pages
Chapter 3 Hadoop
No ratings yet
Chapter 3 Hadoop
10 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
45 pages
Yahoo-Hadoop
No ratings yet
Yahoo-Hadoop
154 pages
Hadoop and Big Data
No ratings yet
Hadoop and Big Data
41 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Unit 5
No ratings yet
Unit 5
7 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Module 2.1
No ratings yet
Module 2.1
21 pages
Hadoop
No ratings yet
Hadoop
5 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
BIGDATA
No ratings yet
BIGDATA
180 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet