Hadoop and Mapreduce

The document discusses MapReduce and Hadoop. It provides an overview of MapReduce, describing the Map and Reduce abstractions and how they work together using word count as an example application. It then discusses Hadoop, describing it as an open-source software framework for storing and processing large datasets across clusters of commodity hardware. Hadoop uses HDFS for storage and MapReduce as its processing framework.

Uploaded by

18941

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

212 views21 pages

Hadoop and Mapreduce

Uploaded by

18941

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

MAPREDUCE AND HADOOP

Submitted By:
Varsha
00804092015
MCA 3rd year
Before MapReduce…

• Large scale data processing was difficult!

• Managing hundreds or thousands of processors
• Managing parallelization and distribution
• I/O Scheduling
• Status and monitoring
• Fault/crash tolerance
• MapReduce provides all of these, easily!
MapReduce Overview

• What is it?
• Programming model used by Google
• A combination of the Map and Reduce models with an
associated implementation
• Used for processing and generating large data sets
Map Abstraction

• Inputs a key/value pair

– Key is a reference to the input value
– Value is the data set on which to operate
• Evaluation
– Function defined by user
– Applies to every value in value input
• Might need to parse input
• Produces a new list of key/value pairs
– Can be different type from input pair
Map Example
Reduce Abstraction
• Typically a function that:
• Starts with a large number of key/value pairs
• One key/value for each word in all files being
grouped (including multiple entries for the same
word)
• Ends with very few key/value pairs
• One key/value for each unique word across all the
files with the number of instances summed into this
entry
• Broken up so a given worker works with input of the
same key.
Reduce Example
How Map and Reduce Work Together

Reduce applies a
user defined
Map returns Reduces accepts
function to reduce
information information
the amount of
data
Example: Word Count
Applications
• MapReduce is built on top of GFS, the Google File System.
Input and output files are stored on GFS.
• While MapReduce is heavily used within Google, it also
found use in companies such as Yahoo, Facebook, and
Amazon.
• The original implementation was done by Google. It is used
internally for a large number of Google services.
• The Apache Hadoop project built a clone to specs defined by
Google. Amazon, in turn, uses Hadoop MapReduce running
on their EC2 (elastic cloud) computing-on-demand service to
offer the Amazon Elastic MapReduce service.
HADOOP
Hadoop
• Hadoop is an open-source software framework
for storing data and running applications on
clusters of commodity hardware.
• It provides massive storage for any kind of data,
enormous processing power and the ability to
handle virtually limitless concurrent tasks or
jobs.
• Hadoop’s strength lies in its ability to scale across
thousands of commodity servers that don’t share
memory or disk space.
• Hadoop delegates tasks across these servers (called
“worker nodes” or “slave nodes”), essentially
harnessing the power of each device and running
them together simultaneously.
• This is what allows massive amounts of data to
be analyzed: splitting the tasks across different
locations in this manner allows bigger jobs to be
completed faster.
• Hadoop can be thought of as an ecosystem—it’s
comprised of many different components that all
work together to create a single platform.
• There are two key functional components within
this ecosystem:
 The storage of data (Hadoop Distributed File
System, or HDFS)
 The framework for running parallel
computations on this data (MapReduce).
Why use??/Why imp??
• Ability to store and process huge amounts of any
kind of data, quickly. With data volumes and
varieties constantly increasing, especially from
social media and the Internet of Things (IoT), that's
a key consideration.
• Computing power. Hadoop's distributed
computing model processes big data fast. The more
computing nodes you use, the more processing
power you have.
• Fault tolerance. Data and application processing
are protected against hardware failure. If a node
goes down, jobs are automatically redirected to
other nodes to make sure the distributed computing
does not fail. Multiple copies of all data are stored
automatically.
• Flexibility. Unlike traditional relational databases, you
don’t have to pre-process data before storing it. You can
store as much data as you want and decide how to use it
later. That includes unstructured data like text, images and
videos.
• Low cost. The open-source framework is free and uses
commodity hardware to store large quantities of data.
• Scalability. You can easily grow your system to handle
more data simply by adding nodes. Little administration
is required.
Hadoop Architecture

At its core, Hadoop has two major layers namely:

(a) Processing/Computation layer (Map Reduce), and
(b) Storage layer (Hadoop Distributed File System).
MapReduce:
Parallel programming model for writing distributed
applications devised at Google for efficient processing of
Large amounts of data
On large clusters (thousands of nodes) of commodity
hardware in a reliable
Fault-tolerant
Runs on Hadoop which is an Apache open-source
framework.

Hadoop Distributed File System(HDFS):

Based on the Google File System (GFS)
Provides a distributed file system that is designed to run on
commodity hardware.
Fault-tolerant and is designed to be deployed on low-cost
hardware.
High throughput access to application data and is suitable
for applications having large datasets.
Hadoop framework also includes the following two modules:

 Hadoop Common: These are Java libraries and utilities required by

other Hadoop modules.

 Hadoop YARN: This is a framework for job scheduling and cluster

resource management.
Security Concerns

Vulnerable by Nature

Not fit for Small Data

Potential Stability Issue

General Limit

Microsoft Power BI Cookbook by Greg Deckler
100% (18)
Microsoft Power BI Cookbook by Greg Deckler
655 pages
The Python Bible
97% (31)
The Python Bible
506 pages
Python 3 Cheat Sheet
94% (51)
Python 3 Cheat Sheet
2 pages
Python Programming. A Step-by-Step Guide For Absolute Beginners
93% (43)
Python Programming. A Step-by-Step Guide For Absolute Beginners
181 pages
The Python Manual
97% (31)
The Python Manual
196 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Python Machine Learning For Beginners Ebook Final
100% (11)
Python Machine Learning For Beginners Ebook Final
305 pages
PowerBI Presentation
100% (5)
PowerBI Presentation
155 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
93% (14)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Hadoop Command Line Interface
No ratings yet
Hadoop Command Line Interface
10 pages
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
100% (10)
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
244 pages
Excel VBA Bundle 2 Books Excel VBA and Macros and 51 Awesome Macros
100% (19)
Excel VBA Bundle 2 Books Excel VBA and Macros and 51 Awesome Macros
230 pages
Python Cheat Sheet: Ata Tructures
100% (12)
Python Cheat Sheet: Ata Tructures
2 pages
Excel Formulas and Functions
85% (26)
Excel Formulas and Functions
126 pages
OReilly Learning Python 4th Edition Oct 2009
100% (19)
OReilly Learning Python 4th Edition Oct 2009
1,214 pages
Machine Learning With Python
100% (14)
Machine Learning With Python
692 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
SQL Commands Cheat Sheet
86% (7)
SQL Commands Cheat Sheet
1 page
Object Oriented Python Tutorial
100% (20)
Object Oriented Python Tutorial
111 pages
Power BI MVP Book 1089210515 PDF
100% (10)
Power BI MVP Book 1089210515 PDF
495 pages
Manual Phyton
100% (5)
Manual Phyton
115 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
Hadoop Big Data Administration
No ratings yet
Hadoop Big Data Administration
6 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Module 3 Nosql
No ratings yet
Module 3 Nosql
12 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
HDFS Commands
No ratings yet
HDFS Commands
6 pages
Hadoop Interviews Q
No ratings yet
Hadoop Interviews Q
9 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
Big Data Analytics - Lab-Manual
No ratings yet
Big Data Analytics - Lab-Manual
19 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
B.tech Viii Bda Chapter 3
No ratings yet
B.tech Viii Bda Chapter 3
21 pages
HADOOP PPT
No ratings yet
HADOOP PPT
21 pages
NoSQL Module 2
No ratings yet
NoSQL Module 2
76 pages
Unit-4-Unit-4-Bda EDIT
No ratings yet
Unit-4-Unit-4-Bda EDIT
16 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
Mapr Snapshots
No ratings yet
Mapr Snapshots
31 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Twitter Sentimental Analysis
No ratings yet
Twitter Sentimental Analysis
42 pages
MapR Certified Hadoop Developer Study Guide (MCHD)
No ratings yet
MapR Certified Hadoop Developer Study Guide (MCHD)
26 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Apache Mahout Essentials - Sample Chapter
No ratings yet
Apache Mahout Essentials - Sample Chapter
25 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Hadoop in Action
No ratings yet
Hadoop in Action
1 page
Cloud-Storage-PPT
No ratings yet
Cloud-Storage-PPT
12 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
Hadoop Tutorial - YDN
No ratings yet
Hadoop Tutorial - YDN
14 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
Edureka Interview Questions - HDFS
No ratings yet
Edureka Interview Questions - HDFS
4 pages
Hadoop For Windows Succinctly PDF
No ratings yet
Hadoop For Windows Succinctly PDF
148 pages
Cloud Application Development in Python: Bahga & Madisetti, © 2014 Book Website: WWW - Cloudcomputingbook.info
No ratings yet
Cloud Application Development in Python: Bahga & Madisetti, © 2014 Book Website: WWW - Cloudcomputingbook.info
20 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Code Obfuscation
No ratings yet
Code Obfuscation
15 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
Bda Super Imp
No ratings yet
Bda Super Imp
35 pages
BDA_UNIT_1
No ratings yet
BDA_UNIT_1
32 pages
Mc5502 Bda Unit I Notes
No ratings yet
Mc5502 Bda Unit I Notes
106 pages
Hadoop ppt@87
No ratings yet
Hadoop ppt@87
16 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
30 pages
Big Data
No ratings yet
Big Data
25 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Hackers Guide To Machine Learning With Python PDF
100% (14)
Hackers Guide To Machine Learning With Python PDF
272 pages
PYTHON Learn Python Programming in 90 Minutes or Less Python Learning Python Python Programming Python Tutorial Python Programming For Beginners Python For Dummies Book 1 PDF
92% (12)
PYTHON Learn Python Programming in 90 Minutes or Less Python Learning Python Python Programming Python Tutorial Python Programming For Beginners Python For Dummies Book 1 PDF
161 pages
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
Data Engineering Cookbook
88% (8)
Data Engineering Cookbook
88 pages
Numpy Basics: Arithmetic Operations
100% (17)
Numpy Basics: Arithmetic Operations
7 pages
Hadoop With Python
100% (6)
Hadoop With Python
71 pages
Python 3 Basics Tutorial
100% (2)
Python 3 Basics Tutorial
128 pages
Python Programming For Beginners - Learn Python Programming in 24 Hours PDF
100% (21)
Python Programming For Beginners - Learn Python Programming in 24 Hours PDF
133 pages
Deep Learning A Z PDF
100% (6)
Deep Learning A Z PDF
799 pages
Protein Assignb - Odt 0
No ratings yet
Protein Assignb - Odt 0
5 pages
Architecture & Consideration
No ratings yet
Architecture & Consideration
7 pages
Assignment 1
No ratings yet
Assignment 1
18 pages
Complexity of Sorting
No ratings yet
Complexity of Sorting
9 pages
Product Ci853 Comli
No ratings yet
Product Ci853 Comli
3 pages
Theory and Lab Exam Preparation Plan For DSA
No ratings yet
Theory and Lab Exam Preparation Plan For DSA
2 pages
Project Report On Embedded System
100% (14)
Project Report On Embedded System
63 pages
Reflection Papers: Gapoy, Frednixen B. BSMT 1-Alpha
No ratings yet
Reflection Papers: Gapoy, Frednixen B. BSMT 1-Alpha
9 pages
Funny Thing
No ratings yet
Funny Thing
4 pages
Final - I DBMS - 2014 - B.TECH
No ratings yet
Final - I DBMS - 2014 - B.TECH
2 pages
Delfiscan M81: Imaging Technology
No ratings yet
Delfiscan M81: Imaging Technology
2 pages
MC6840 Programmable Timer (PTM) : Semiconductor
No ratings yet
MC6840 Programmable Timer (PTM) : Semiconductor
15 pages
Introduction To Data Modeling
No ratings yet
Introduction To Data Modeling
9 pages
Exercises: Column Name Data Type Siz e
No ratings yet
Exercises: Column Name Data Type Siz e
4 pages
CaseStudy SQL Part3 Allen Joe Winny
No ratings yet
CaseStudy SQL Part3 Allen Joe Winny
6 pages
Data Base Security Syllabus
No ratings yet
Data Base Security Syllabus
2 pages
PDF TNPM Installguide
No ratings yet
PDF TNPM Installguide
334 pages
Lecture 4
No ratings yet
Lecture 4
34 pages
Tivoli Storage Manager 7.1.1 Update
No ratings yet
Tivoli Storage Manager 7.1.1 Update
28 pages
Dbms Homework
100% (1)
Dbms Homework
6 pages
Sample CV Sas
No ratings yet
Sample CV Sas
7 pages
DBMS Concepts & Programming in FoxPro
No ratings yet
DBMS Concepts & Programming in FoxPro
3 pages
TF 10 Client Installation Guide
No ratings yet
TF 10 Client Installation Guide
34 pages
Memo Payload Axis XL
100% (1)
Memo Payload Axis XL
3 pages
AWS Services For DevOps
No ratings yet
AWS Services For DevOps
2 pages
CitrixPorts by Port 1103
No ratings yet
CitrixPorts by Port 1103
23 pages
Algorithms: Iob-Cache: A High-Performance Configurable Open-Source Cache
No ratings yet
Algorithms: Iob-Cache: A High-Performance Configurable Open-Source Cache
20 pages
Database System Architecture: Unit 1
No ratings yet
Database System Architecture: Unit 1
11 pages
Computer Networks Midsems 17074004
No ratings yet
Computer Networks Midsems 17074004
8 pages
Network Operating System
No ratings yet
Network Operating System
5 pages
Systems and Network Administration
100% (1)
Systems and Network Administration
54 pages
Java Programming For BSC It 4th Sem Kuvempu University
100% (1)
Java Programming For BSC It 4th Sem Kuvempu University
52 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages

Hadoop and Mapreduce

Uploaded by

Hadoop and Mapreduce

Uploaded by

MAPREDUCE AND HADOOP

• Large scale data processing was difficult!

• Inputs a key/value pair

At its core, Hadoop has two major layers namely:

Hadoop Distributed File System(HDFS):

 Hadoop Common: These are Java libraries and utilities required by

 Hadoop YARN: This is a framework for job scheduling and cluster

Not fit for Small Data

Potential Stability Issue

You might also like