BigData Hadoop Online Training by Experts
BigData Hadoop Online Training by Experts
Contact Us:
India:
USA :
8121660044
732-419-2619
Site: http://www.hadooponlinetutor.com
Introduction
Big Data:
Big data is a term used to describe the voluminous amount of
unstructured and semi-structured data a company creates.
Data that would take too much time and cost too much money to load
into a relational database for analysis.
Big data doesn't refer to any specific quantity, the term is often used
when speaking about petabytes and exabytes of data.
http://www.hadooponlinetutor.com
The New York Stock Exchange generates about one terabyte of new trade data per day.
The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20
terabytes per month.
The Large Hadron Collider near Geneva, Switzerland, produces about 15 petabytes of
data per year.
http://www.hadooponlinetutor.com
1
2
Year
1990
2010
1370
1000000
Year
1990
4.4
2010
100
http://www.hadooponlinetutor.com
http://www.hadooponlinetutor.com
So What do We Do?
The obvious solution is that we use
multiple processors to solve the same
problem by fragmenting it into pieces.
Imagine if we had 100 drives, each
holding one hundredth of the data.
Working in parallel, we could read the
data in under two minutes.
http://www.hadooponlinetutor.com
Distributed Computing Vs
Parallelization
Parallelization-
http://www.hadooponlinetutor.com
Examples
Cray-2 was a four-processor ECL
vector supercomputer made by
Cray Research starting in 1985
http://www.hadooponlinetutor.com
Distributed Computing
The key issues involved in this Solution:
Hardware failure
Combine the data after analysis
Network Associated Problems
http://www.hadooponlinetutor.com
Deep Blue
Multiplying Large Matrices
Simulating several 100s of charactersLOTRs
Index the Web (Google)
Simulating an internet size network for
network experiments
http://www.hadooponlinetutor.com
http://www.hadooponlinetutor.com
To The Rescue!
Core
Avro
Pig
HBase
Zookeeper
Hive
Chukwa
http://www.hadooponlinetutor.com
The theoretical 1000-CPU machine would cost a very large amount of money,
far more than 1,000 single-CPU.
Hadoop will tie these smaller and more reasonably priced machines together
into a single cost-effective compute cluster.
http://www.hadooponlinetutor.com
MapReduce
http://www.hadooponlinetutor.com
MapReduce
By restricting the communication between nodes, Hadoop makes the distributed system
much more reliable. Individual node failures can be worked around by restarting tasks
on other machines.
The other workers continue to operate as though nothing went wrong, leaving the
challenging aspects of partially restarting the program to the underlying Hadoop layer.
http://www.hadooponlinetutor.com
What is MapReduce?
http://www.hadooponlinetutor.com
Map, written by the user, takes an input pair and produces a set of intermediate
key/value pairs. The MapReduce library groups together all intermediate values
associated with the same intermediate key I and passes them to the Reduce
function.
http://www.hadooponlinetutor.com
http://www.hadooponlinetutor.com
The Reduce function, also written by the user, accepts an intermediate key I and a set of values
for that key. It merges together these values to form a possibly smaller set of values
http://www.hadooponlinetutor.com
This abstraction allows us to handle lists of values that are too large to fit in memory.
Example:
http://www.hadooponlinetutor.com
Orientation of Nodes
Data Locality Optimization:
The computer nodes and the storage nodes are the same. The Map-Reduce
framework and the Distributed File System run on the same set of nodes. This
configuration allows the framework to effectively schedule tasks on the nodes where
data is already present, resulting in very high aggregate bandwidth across the
cluster.
If this is not possible: The computation is done by another processor on the same
rack.
http://www.hadooponlinetutor.com
Moving Computation is Cheaper than Moving Data
A Map-Reduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner.
The framework sorts the outputs of the maps, which are then input to the reduce tasks.
Typically both the input and the output of the job are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the failed
tasks.
A MapReduce job is a unit of work that the client wants to be performed: it consists of
the input data, the MapReduce program, and configuration information. Hadoop runs
the job by dividing it into tasks, of which there are two types: map tasks and reduce
tasks
http://www.hadooponlinetutor.com
Fault Tolerance
There are two types of nodes that control the job execution process: tasktrackers and
jobtrackers
The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on
tasktrackers.
Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record
of the overall progress of each job.
http://www.hadooponlinetutor.com
http://www.hadooponlinetutor.com
Input Splits
Input splits: Hadoop divides the input to a MapReduce job into fixed-size
pieces called input splits, or just splits. Hadoop creates one map task for each
split, which runs the user-defined map function for each record in the split.
The quality of the load balancing increases as the splits become more finegrained.
BUT if splits are too small, then the overhead of managing the splits and of map
task creation begins to dominate the total job execution time. For most jobs, a
good split size tends to be the size of a HDFS block, 64 MB by default.
WHY?
Map tasks write their output to local disk, not to HDFS. Map output is
intermediate output: its processed by reduce tasks to produce the final output,
and once the job is complete the map output can be thrown away. So storing it
in HDFS, with replication, would be a waste of time. It is also possible that the
node running the map task fails before the map output has been consumed by
the reduce task.
http://www.hadooponlinetutor.com
http://www.hadooponlinetutor.com
http://www.hadooponlinetutor.com
MapReduce data flow with a single reduce task
http://www.hadooponlinetutor.com
MapReduce data flow with multiple reduce tasks
http://www.hadooponlinetutor.com
MapReduce data flow with no reduce tasks
Combiner Functions
Many MapReduce jobs are limited by the bandwidth available on the cluster.
In order to minimize the data transferred between the map and reduce tasks, combiner
functions are introduced.
Hadoop allows the user to specify a combiner function to be run on the map outputthe
combiner functions output forms the input to the reduce function.
Combiner finctions can help cut down the amount of data shuffled between the maps and
the reduces.
http://www.hadooponlinetutor.com
Hadoop Streaming:
Hadoop provides an API to MapReduce that allows you to
write your map and reduce functions in languages other than
Java.
Hadoop Streaming uses Unix standard streams as the
interface between Hadoop and your program, so you can use
any language that can read standard input and write to
standard output to write your MapReduce program.
http://www.hadooponlinetutor.com
Hadoop Pipes:
Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.
Unlike Streaming, which uses standard input and output to communicate
with the map and reduce code, Pipes uses sockets as the channel over
which the tasktracker communicates with the process running the C++ map
or reduce function. JNI is not used.
http://www.hadooponlinetutor.com
HADOOP DISTRIBUTED
FILESYSTEM (HDFS)
Filesystems that manage the storage across a network of machines are called
distributed filesystems.
Hadoop comes with a distributed filesystem called HDFS, which stands for
Hadoop Distributed Filesystem.
http://www.hadooponlinetutor.com
http://www.hadooponlinetutor.com
Goals of HDFS
Streaming Data Access
Applications that run on HDFS need streaming access to their data sets.
They are not general purpose applications that typically run on general
purpose file systems. HDFS is designed more for batch processing rather
than interactive use by users. The emphasis is on high throughput of data
access rather than low latency of data access. POSIX imposes many hard
requirements that are not needed for applications that are targeted for
HDFS. POSIX semantics in a few key areas has been traded to increase
data throughput rates.
Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A
file once created, written, and closed need not be changed. This
assumption simplifies data coherency issues and enables high throughput
data access. A Map/Reduce application or a web crawler application fits
perfectly with this model. There is a planhttp://www.hadooponlinetutor.com
to support appending-writes to
Design of HDFS
http://www.hadooponlinetutor.com
http://www.hadooponlinetutor.com
Commodity hardware
Hadoop doesnt require expensive, highly reliable hardware to run on.
Its designed to run on clusters of commodity hardware for which the
chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure. It is also worth
examining the applications for which using HDFS does not work so
well. While this may change in the future, these are areas where HDFS
is not a good fit today:
http://www.hadooponlinetutor.com
Contact Us:
Our Address:
#444, 4th floor, Gumidelli Commercial Complex
Reliance Trends Building
Begumpet, Hyderabad
Phone:
USA : +1 732-419-2619
INDIA: +91 8121660044
Email:
[email protected]
Website: http://www.hadooponlinetutor.com