0% found this document useful (0 votes)

12 views40 pages

Module 3 - Mapreduce

Uploaded by

Aditya Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views40 pages

Module 3 - Mapreduce

Uploaded by

Aditya Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

MapReduce

Part - 3
BY : DR.RASHMI L MALGHAN & MS.
SHAVANTREVVA S BILAKERI
AGENDA
Hadoop Main Components:
Comparison: Convention Vs.
MapReduce
•MapReduce Approach
•Conventional Approach
• Distributed processing across a
• Single-machine processing.
cluster.
• Suitable for small datasets.
• Scalable for large datasets.
• Limited scalability.
• Handles parallel processing
efficiently.
• May become a bottleneck for big
data.
• Tackles the challenges of big data
MapReduce:
MapReduce - Nutshell
Advantage 1: Parallel Processing
• In Hadoop data gets
divided in to small chunks
called as HDFS blocks.

• Scalability
• Fault Tolerance
• Flexibility
• Efficient Resource
Utilization
Advantage 2: Data Locality
• Processing – Master sends the logic to
salves that require for its job processing.
• Smaller chunk of data is getting processed
in multiple locations in parallel.
• It saves “time” & “network bandwidth”-
required to move/transfer large amount of
data from one point/location to another.
• Results will be sent back to master
machines and sent to client machine

• Data Locality - MapReduce is processing at “Location” where data is stored rather bringing data to “centralised
server”
• Client Sends/submits data – To Resource Manager decides – Data usually resides on “Nearest Data Node” to
reduce the network bandwidth.
• Master = Name Node – In Hdfs [Storage] , Master – Consideration -resources manager - Processing
MapReduce: Phases and Deamons

MapReduce Framework

Phases:
Map(): Converts input into Key
Value pair Daemons:
JobTracker: Master schedules task
Reduce(): Combines output of TaskTracker: Slave executes task
mappers and produces a reduced
result set.
JobTracker and TaskTracker Interaction
MapReduce Programming
Computing Task
Workflow JobTracker

Map Tasks Reduce Tasks

Map Task
Map Task
TaskTracker2
TaskTracker1
Final Output
Data Block2
Data Block1

Map Output2
Map Output1
Phases in Map and Reduce Tasks

Map Phases Reduce Phases

1.Record Reader 1.Shuffle
2.Mapper 2.Sort

3.Combiner 3.Reducer

4.Partitioner 4.Output Format

Chores of Mapper, Combiner,
Partitioner, and Reducer
Chores of Mapper, Combiner,
Partitioner, and Reducer
MapReduce: Word count problem
Word Count Program using MapReduce

Step 1: Create a file with the name word_count_data.txt and add some data to it

Step 2: Create a mapper.py file that implements the mapper

logic.
#!/usr/bin/env python
Mapper.py # import sys because we need to read and write data to STDIN and
STDOUT
import sys

# reading entire line from STDIN (standard input)

for line in sys.stdin:
# to remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()

# we are looping over the words array and printing the word
# with the count of 1 to the STDOUT
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
print '%s\t%s' % (word, 1)

#! is known as shebang and used for interpreting the script.

Test mapper.py locally
cat word_count_data.txt | python mapper.py
Step 3: Create a reducer.py file that implements the reducer logic.
Content: #!/usr/bin/python3
"reducer.py"
import sys
current_word = None
current_count = 0

for line in sys.stdin:

# remove leading and trailing whitespaces
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t')
count = int(count)

if current_word == word:
current_count += count
else:
if current_word:
print ('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print ('%s\t%s' % (current_word, current_count))
Test mapper and reduce
cat word_count_data.txt | python mapper.py | sort | python reducer.py
Step 4: Now let’s start all our Hadoop daemons with the command:.
start-all.sh
hdfs dfs -mkdir /word_count_in_python
hdfs dfs -copyFromLocal /home/ssb/Documents/word_count_data.txt /word_count_in_python
chmod 777 mapper.py reducer.py

Step 5: Now download the latest hadoop-streaming jar file. Then place, this Hadoop,-streaming jar file to a place
from you can easily access it.

Step6: Run our python files with the help of the Hadoop streaming utility

hadoop jar /home/ssb/Documents/hadoop-streaming-2.7.3.jar \

> -input /word_count_in_python/word_count_data.txt \

> -output /word_count_in_python/output \

> -mapper /home/ssb/Documents/mapper.py \

> -reducer /home/ssb/Documents/reducer.py

Example: Election Votes Counting
Traditional Approach: Election
Votes Counting
MapReduce Approach: Election
Votes Counting

• Counting Part of respective booth was done by map function and sent
to reducer.
• Combining of results was done by reducer function
Anatomy of MapReduce:
Input Output
Exercise: MapReduce

1. Search specific keyword in a file

2. Sort data by student name

3. Arrange the data on user ID, then within user ID sort them
in increasing order of page count

4. Write MapReduce program to find unit wise salary

1. hadoop namenode –format
How to Interact with HDFS
2. hadoop secondarynamenode

3. hadoop namenode start-all.sh

4. hadoop datanode

5. hadoop dfsadmin -report

6. hadoop mradmin -refreshNodes

7. hdfs dfsadmin -setBalancerBandwidth <bandwidth>

8. hdfs fsck /user/example/data

9. hdfs fsck /path/to/hdfs/directory -files -blocks –locations

10. hdfs fsck /path/to/hdfs/directory -files -blocks -locations -racks –replicaDetails

11. hdfs fsck /path/to/hdfs/directory -files -blocks -locations -racks -replicaDetails > fsck_report.txt
1. hadoop mradmin -refreshNodes -decommission <hostname>

2. hadoop mradmin -refreshNodes -commission <hostname>

3. hdfs dfs -ls /path/to/directory

http://<TaskTracker-Hostname>:50060
4. hdfs dfs -copyToLocal /path/in/hdfs localfile
http://<JobTracker-Hostname>:50030
5. hdfs dfs -rm /path/to/file
hadoop balancer -h
6. hdfs dfs -rm -r /user/hadoop/data

7. hdfs dfs -mv /path/to/source /path/to/destination

8. hdfs dfs –put <src path> <dest path>

9. Hdfs –get <hdfs file path> <local path>

Interact with MapReduce Jobs
Yet Another Resource Negotiator (YARN)

• YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove
the bottleneck on Job Tracker which was present in Hadoop 1.0.

• YARN was described as a “Redesigned Resource Manager” at the time of its launching

• YARN architecture basically separates resource management layer from the processing layer.

• YARN also allows different data processing engines like graph processing, interactive processing,
stream processing as well as batch processing.

• it can dynamically allocate various resources and schedule the application processing
YARN Features
•Scalability: The scheduler in Resource manager of YARN architecture allows
Hadoop to extend and manage thousands of nodes and clusters.

•Compatibility: YARN supports the existing map-reduce applications without

disruptions thus making it compatible with Hadoop 1.0 as well.

•Cluster Utilization: Since YARN supports Dynamic utilization of cluster in

Hadoop, which enables optimized Cluster Utilization.

•Multi-tenancy: It allows multiple engine access thus giving organizations a benefit

of multi-tenancy.
The main components of YARN architecture

• Client
• Resource Manager
1. Scheduler
2. Application manager
• Node Manager
• Application Master
• Container
Application workflow in Hadoop YARN
1. Client submits an application
2. The Resource Manager allocates a container to start the
Application Manager
3. The Application Manager registers itself with the Resource
Manager
4. The Application Manager negotiates containers from the
Resource Manager
5. The Application Manager notifies the Node Manager to
launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to
monitor application’s status
8. Once the processing is complete, the Application Manager
un-registers with the Resource Manager
Application : Workflow
YARN : MapReduce Beyond

• Using YARN lot of frameworks are able to utilize and connect to HDFS
• YARN – Open gate for frameworks, other search engines and even
other big data applications also
• Application of “HDFS ” increased because of YARN as resource provider
Hadoop Daemons:
Hadoop 2.X MapReduce Yarn Components:
YARN: Workflow

Unit II - Ethereum Overview - Blockchain Notes
No ratings yet
Unit II - Ethereum Overview - Blockchain Notes
25 pages
Describe The Functions and Features of HDP
100% (2)
Describe The Functions and Features of HDP
16 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
24 pages
3.4 Map Scheduler
No ratings yet
3.4 Map Scheduler
23 pages
02-Hadoop
No ratings yet
02-Hadoop
117 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
lsde_workshop_wk9(2)
No ratings yet
lsde_workshop_wk9(2)
31 pages
Big Data-Week 3 - 1
No ratings yet
Big Data-Week 3 - 1
22 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1
No ratings yet
Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1
87 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
UNIT III
No ratings yet
UNIT III
38 pages
Unit 3
No ratings yet
Unit 3
18 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
B. Hadoop Ecosystem_III (MapReduce)
No ratings yet
B. Hadoop Ecosystem_III (MapReduce)
55 pages
bda unit 3 - mam
No ratings yet
bda unit 3 - mam
89 pages
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
24 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
unit5 b
No ratings yet
unit5 b
4 pages
Bda 03
No ratings yet
Bda 03
10 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Part 03 Intro To Hadoop
No ratings yet
Part 03 Intro To Hadoop
22 pages
ECS765P_W3_Hadoop principles and components
No ratings yet
ECS765P_W3_Hadoop principles and components
47 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Big Data Exam Help
No ratings yet
Big Data Exam Help
7 pages
Unit-2 Bda Kalyan - Pagenumber
No ratings yet
Unit-2 Bda Kalyan - Pagenumber
15 pages
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
30 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
UNIT 2
No ratings yet
UNIT 2
19 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
CS702_Big_Data_Programs
No ratings yet
CS702_Big_Data_Programs
58 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
Module 03 MapReduce - Distributed Off-line Batch Processing and Yarn - Resource Negotiator
No ratings yet
Module 03 MapReduce - Distributed Off-line Batch Processing and Yarn - Resource Negotiator
43 pages
Mapreduce, Hadoop and Amazon Aws: Yasser Ganjisaffar
No ratings yet
Mapreduce, Hadoop and Amazon Aws: Yasser Ganjisaffar
33 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Big data unit 3 own
No ratings yet
Big data unit 3 own
20 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
UNIT III Notes_18540760ab9652a7b4b8d9c1d0f56f3c
No ratings yet
UNIT III Notes_18540760ab9652a7b4b8d9c1d0f56f3c
24 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
29 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Big Data Notes Unit-3
No ratings yet
Big Data Notes Unit-3
7 pages
HADOOP
No ratings yet
HADOOP
19 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
SABDE3G05 Big Data MapReduce Yarn
No ratings yet
SABDE3G05 Big Data MapReduce Yarn
69 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
cloud computing unit-1
No ratings yet
cloud computing unit-1
51 pages
Remote Procedure Call Intro
No ratings yet
Remote Procedure Call Intro
4 pages
Evolution of Cryptoassets
No ratings yet
Evolution of Cryptoassets
12 pages
2 Phase
No ratings yet
2 Phase
31 pages
IEEE Conference Template Example
No ratings yet
IEEE Conference Template Example
14 pages
Introduction and Socket PPT 2 - PDF
No ratings yet
Introduction and Socket PPT 2 - PDF
45 pages
Communication and Control in Electric Power Systems Applications of Parallel and Distributed Processing PDF
0% (1)
Communication and Control in Electric Power Systems Applications of Parallel and Distributed Processing PDF
2 pages
Learn Well Technocraft: Hadoop/Big Data Syllabus
No ratings yet
Learn Well Technocraft: Hadoop/Big Data Syllabus
12 pages
Ds/Unit 1 Truba College of Science & Tech., Bhopal
No ratings yet
Ds/Unit 1 Truba College of Science & Tech., Bhopal
9 pages
Introduction To Parallel and Distributed Computing
No ratings yet
Introduction To Parallel and Distributed Computing
29 pages
Hadoop
No ratings yet
Hadoop
7 pages
Chord
No ratings yet
Chord
47 pages
Unit - 1 Architecture of Distributed Systems
No ratings yet
Unit - 1 Architecture of Distributed Systems
22 pages
Real Time and Embedded System Assignment Group-6
No ratings yet
Real Time and Embedded System Assignment Group-6
34 pages
Whitepaper
No ratings yet
Whitepaper
9 pages
Unit Iv
No ratings yet
Unit Iv
20 pages
MQTT Protocol: Lecturer: Dr. Bui Ha Duc Dept. of Mechatronics Email: Ducbh@hcmute - Edu.vn
No ratings yet
MQTT Protocol: Lecturer: Dr. Bui Ha Duc Dept. of Mechatronics Email: Ducbh@hcmute - Edu.vn
16 pages
Slideset 03
No ratings yet
Slideset 03
79 pages
Weiss Cryptocurrency Ratings Bit PDF
No ratings yet
Weiss Cryptocurrency Ratings Bit PDF
6 pages
Distributed
No ratings yet
Distributed
2 pages
Operating Systems-II
No ratings yet
Operating Systems-II
21 pages
Tutorial9 Solutions 2
No ratings yet
Tutorial9 Solutions 2
3 pages
Quis Binance 7
No ratings yet
Quis Binance 7
4 pages
Transaction Management
No ratings yet
Transaction Management
26 pages
22B12CS422 Cloud Computing Essentials Azure and AWS CD Even 22 23
No ratings yet
22B12CS422 Cloud Computing Essentials Azure and AWS CD Even 22 23
3 pages
Wollo University KIOT Department of Software Engineering Agent Based Programming Assignment 1
No ratings yet
Wollo University KIOT Department of Software Engineering Agent Based Programming Assignment 1
11 pages
The Richest People in Cryptocurrency
No ratings yet
The Richest People in Cryptocurrency
8 pages
IaaS
No ratings yet
IaaS
9 pages
Sleeping Barber
No ratings yet
Sleeping Barber
7 pages

Module 3 - Mapreduce

Uploaded by

Module 3 - Mapreduce

Uploaded by

MapReduce

Map Tasks Reduce Tasks

Map Phases Reduce Phases

4.Partitioner 4.Output Format

Step 2: Create a mapper.py file that implements the mapper

# reading entire line from STDIN (standard input)

#! is known as shebang and used for interpreting the script.

for line in sys.stdin:

hadoop jar /home/ssb/Documents/hadoop-streaming-2.7.3.jar \

> -input /word_count_in_python/word_count_data.txt \

> -output /word_count_in_python/output \

> -mapper /home/ssb/Documents/mapper.py \

> -reducer /home/ssb/Documents/reducer.py

1. Search specific keyword in a file

2. Sort data by student name

4. Write MapReduce program to find unit wise salary

3. hadoop namenode start-all.sh

5. hadoop dfsadmin -report

6. hadoop mradmin -refreshNodes

7. hdfs dfsadmin -setBalancerBandwidth <bandwidth>

8. hdfs fsck /user/example/data

9. hdfs fsck /path/to/hdfs/directory -files -blocks –locations

10. hdfs fsck /path/to/hdfs/directory -files -blocks -locations -racks –replicaDetails

2. hadoop mradmin -refreshNodes -commission <hostname>

3. hdfs dfs -ls /path/to/directory

7. hdfs dfs -mv /path/to/source /path/to/destination

8. hdfs dfs –put <src path> <dest path>

9. Hdfs –get <hdfs file path> <local path>

•Compatibility: YARN supports the existing map-reduce applications without

•Cluster Utilization: Since YARN supports Dynamic utilization of cluster in

•Multi-tenancy: It allows multiple engine access thus giving organizations a benefit

You might also like