Hadoop and Mapreduce
Hadoop and Mapreduce
Submitted By:
Varsha
00804092015
MCA 3rd year
Before MapReduce…
• What is it?
• Programming model used by Google
• A combination of the Map and Reduce models with an
associated implementation
• Used for processing and generating large data sets
Map Abstraction
Reduce applies a
user defined
Map returns Reduces accepts
function to reduce
information information
the amount of
data
Example: Word Count
Applications
• MapReduce is built on top of GFS, the Google File System.
Input and output files are stored on GFS.
• While MapReduce is heavily used within Google, it also
found use in companies such as Yahoo, Facebook, and
Amazon.
• The original implementation was done by Google. It is used
internally for a large number of Google services.
• The Apache Hadoop project built a clone to specs defined by
Google. Amazon, in turn, uses Hadoop MapReduce running
on their EC2 (elastic cloud) computing-on-demand service to
offer the Amazon Elastic MapReduce service.
HADOOP
Hadoop
• Hadoop is an open-source software framework
for storing data and running applications on
clusters of commodity hardware.
• It provides massive storage for any kind of data,
enormous processing power and the ability to
handle virtually limitless concurrent tasks or
jobs.
• Hadoop’s strength lies in its ability to scale across
thousands of commodity servers that don’t share
memory or disk space.
• Hadoop delegates tasks across these servers (called
“worker nodes” or “slave nodes”), essentially
harnessing the power of each device and running
them together simultaneously.
• This is what allows massive amounts of data to
be analyzed: splitting the tasks across different
locations in this manner allows bigger jobs to be
completed faster.
• Hadoop can be thought of as an ecosystem—it’s
comprised of many different components that all
work together to create a single platform.
• There are two key functional components within
this ecosystem:
The storage of data (Hadoop Distributed File
System, or HDFS)
The framework for running parallel
computations on this data (MapReduce).
Why use??/Why imp??
• Ability to store and process huge amounts of any
kind of data, quickly. With data volumes and
varieties constantly increasing, especially from
social media and the Internet of Things (IoT), that's
a key consideration.
• Computing power. Hadoop's distributed
computing model processes big data fast. The more
computing nodes you use, the more processing
power you have.
• Fault tolerance. Data and application processing
are protected against hardware failure. If a node
goes down, jobs are automatically redirected to
other nodes to make sure the distributed computing
does not fail. Multiple copies of all data are stored
automatically.
• Flexibility. Unlike traditional relational databases, you
don’t have to pre-process data before storing it. You can
store as much data as you want and decide how to use it
later. That includes unstructured data like text, images and
videos.
• Low cost. The open-source framework is free and uses
commodity hardware to store large quantities of data.
• Scalability. You can easily grow your system to handle
more data simply by adding nodes. Little administration
is required.
Hadoop Architecture
Vulnerable by Nature
General Limit