0% found this document useful (0 votes)
163 views

Hadoop and Mapreduce

The document discusses MapReduce and Hadoop. It provides an overview of MapReduce, describing the Map and Reduce abstractions and how they work together using word count as an example application. It then discusses Hadoop, describing it as an open-source software framework for storing and processing large datasets across clusters of commodity hardware. Hadoop uses HDFS for storage and MapReduce as its processing framework.

Uploaded by

18941
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
163 views

Hadoop and Mapreduce

The document discusses MapReduce and Hadoop. It provides an overview of MapReduce, describing the Map and Reduce abstractions and how they work together using word count as an example application. It then discusses Hadoop, describing it as an open-source software framework for storing and processing large datasets across clusters of commodity hardware. Hadoop uses HDFS for storage and MapReduce as its processing framework.

Uploaded by

18941
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

MAPREDUCE AND HADOOP

Submitted By:
Varsha
00804092015
MCA 3rd year
Before MapReduce…

• Large scale data processing was difficult!


• Managing hundreds or thousands of processors
• Managing parallelization and distribution
• I/O Scheduling
• Status and monitoring
• Fault/crash tolerance
• MapReduce provides all of these, easily!
MapReduce Overview

• What is it?
• Programming model used by Google
• A combination of the Map and Reduce models with an
associated implementation
• Used for processing and generating large data sets
Map Abstraction

• Inputs a key/value pair


– Key is a reference to the input value
– Value is the data set on which to operate
• Evaluation
– Function defined by user
– Applies to every value in value input
• Might need to parse input
• Produces a new list of key/value pairs
– Can be different type from input pair
Map Example
Reduce Abstraction
• Typically a function that:
• Starts with a large number of key/value pairs
• One key/value for each word in all files being
grouped (including multiple entries for the same
word)
• Ends with very few key/value pairs
• One key/value for each unique word across all the
files with the number of instances summed into this
entry
• Broken up so a given worker works with input of the
same key.
Reduce Example
How Map and Reduce Work Together

Reduce applies a
user defined
Map returns Reduces accepts
function to reduce
information information
the amount of
data
Example: Word Count
Applications
• MapReduce is built on top of GFS, the Google File System.
Input and output files are stored on GFS.
• While MapReduce is heavily used within Google, it also
found use in companies such as Yahoo, Facebook, and
Amazon.
• The original implementation was done by Google. It is used
internally for a large number of Google services.
• The Apache Hadoop project built a clone to specs defined by
Google. Amazon, in turn, uses Hadoop MapReduce running
on their EC2 (elastic cloud) computing-on-demand service to
offer the Amazon Elastic MapReduce service.
HADOOP
Hadoop
• Hadoop is an open-source software framework
for storing data and running applications on
clusters of commodity hardware.
• It provides massive storage for any kind of data,
enormous processing power and the ability to
handle virtually limitless concurrent tasks or
jobs.
• Hadoop’s strength lies in its ability to scale across
thousands of commodity servers that don’t share
memory or disk space.
• Hadoop delegates tasks across these servers (called
“worker nodes” or “slave nodes”), essentially
harnessing the power of each device and running
them together simultaneously.
• This is what allows massive amounts of data to
be analyzed: splitting the tasks across different
locations in this manner allows bigger jobs to be
completed faster.
• Hadoop can be thought of as an ecosystem—it’s
comprised of many different components that all
work together to create a single platform.
• There are two key functional components within
this ecosystem:
 The storage of data (Hadoop Distributed File
System, or HDFS)
 The framework for running parallel
computations on this data (MapReduce).
Why use??/Why imp??
• Ability to store and process huge amounts of any
kind of data, quickly. With data volumes and
varieties constantly increasing, especially from
social media and the Internet of Things (IoT), that's
a key consideration.
• Computing power. Hadoop's distributed
computing model processes big data fast. The more
computing nodes you use, the more processing
power you have.
• Fault tolerance. Data and application processing
are protected against hardware failure. If a node
goes down, jobs are automatically redirected to
other nodes to make sure the distributed computing
does not fail. Multiple copies of all data are stored
automatically.
• Flexibility. Unlike traditional relational databases, you
don’t have to pre-process data before storing it. You can
store as much data as you want and decide how to use it
later. That includes unstructured data like text, images and
videos.
• Low cost. The open-source framework is free and uses
commodity hardware to store large quantities of data.
• Scalability. You can easily grow your system to handle
more data simply by adding nodes. Little administration
is required.
Hadoop Architecture

At its core, Hadoop has two major layers namely:


(a) Processing/Computation layer (Map Reduce), and
(b) Storage layer (Hadoop Distributed File System).
MapReduce:
Parallel programming model for writing distributed
applications devised at Google for efficient processing of
Large amounts of data
On large clusters (thousands of nodes) of commodity
hardware in a reliable
Fault-tolerant
Runs on Hadoop which is an Apache open-source
framework.

Hadoop Distributed File System(HDFS):


Based on the Google File System (GFS)
Provides a distributed file system that is designed to run on
commodity hardware.
Fault-tolerant and is designed to be deployed on low-cost
hardware.
High throughput access to application data and is suitable
for applications having large datasets.
Hadoop framework also includes the following two modules:

 Hadoop Common: These are Java libraries and utilities required by


other Hadoop modules.

 Hadoop YARN: This is a framework for job scheduling and cluster


resource management.
Security Concerns

Vulnerable by Nature

Not fit for Small Data

Potential Stability Issue

General Limit

You might also like