Module 3 - Mapreduce
Module 3 - Mapreduce
Part - 3
BY : DR.RASHMI L MALGHAN & MS.
SHAVANTREVVA S BILAKERI
AGENDA
Hadoop Main Components:
Comparison: Convention Vs.
MapReduce
•MapReduce Approach
•Conventional Approach
• Distributed processing across a
• Single-machine processing.
cluster.
• Suitable for small datasets.
• Scalable for large datasets.
• Limited scalability.
• Handles parallel processing
efficiently.
• May become a bottleneck for big
data.
• Tackles the challenges of big data
MapReduce:
MapReduce - Nutshell
Advantage 1: Parallel Processing
• In Hadoop data gets
divided in to small chunks
called as HDFS blocks.
• Scalability
• Fault Tolerance
• Flexibility
• Efficient Resource
Utilization
Advantage 2: Data Locality
• Processing – Master sends the logic to
salves that require for its job processing.
• Smaller chunk of data is getting processed
in multiple locations in parallel.
• It saves “time” & “network bandwidth”-
required to move/transfer large amount of
data from one point/location to another.
• Results will be sent back to master
machines and sent to client machine
• Data Locality - MapReduce is processing at “Location” where data is stored rather bringing data to “centralised
server”
• Client Sends/submits data – To Resource Manager decides – Data usually resides on “Nearest Data Node” to
reduce the network bandwidth.
• Master = Name Node – In Hdfs [Storage] , Master – Consideration -resources manager - Processing
MapReduce: Phases and Deamons
MapReduce Framework
Phases:
Map(): Converts input into Key
Value pair Daemons:
JobTracker: Master schedules task
Reduce(): Combines output of TaskTracker: Slave executes task
mappers and produces a reduced
result set.
JobTracker and TaskTracker Interaction
MapReduce Programming
Computing Task
Workflow JobTracker
Map Task
Map Task
TaskTracker2
TaskTracker1
Final Output
Data Block2
Data Block1
Map Output2
Map Output1
Phases in Map and Reduce Tasks
3.Combiner 3.Reducer
Step 1: Create a file with the name word_count_data.txt and add some data to it
# we are looping over the words array and printing the word
# with the count of 1 to the STDOUT
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
print '%s\t%s' % (word, 1)
if current_word == word:
current_count += count
else:
if current_word:
print ('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print ('%s\t%s' % (current_word, current_count))
Test mapper and reduce
cat word_count_data.txt | python mapper.py | sort | python reducer.py
Step 4: Now let’s start all our Hadoop daemons with the command:.
start-all.sh
hdfs dfs -mkdir /word_count_in_python
hdfs dfs -copyFromLocal /home/ssb/Documents/word_count_data.txt /word_count_in_python
chmod 777 mapper.py reducer.py
Step 5: Now download the latest hadoop-streaming jar file. Then place, this Hadoop,-streaming jar file to a place
from you can easily access it.
Step6: Run our python files with the help of the Hadoop streaming utility
• Counting Part of respective booth was done by map function and sent
to reducer.
• Combining of results was done by reducer function
Anatomy of MapReduce:
Input Output
Exercise: MapReduce
3. Arrange the data on user ID, then within user ID sort them
in increasing order of page count
4. hadoop datanode
11. hdfs fsck /path/to/hdfs/directory -files -blocks -locations -racks -replicaDetails > fsck_report.txt
1. hadoop mradmin -refreshNodes -decommission <hostname>
• YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove
the bottleneck on Job Tracker which was present in Hadoop 1.0.
• YARN was described as a “Redesigned Resource Manager” at the time of its launching
• YARN architecture basically separates resource management layer from the processing layer.
• YARN also allows different data processing engines like graph processing, interactive processing,
stream processing as well as batch processing.
• it can dynamically allocate various resources and schedule the application processing
YARN Features
•Scalability: The scheduler in Resource manager of YARN architecture allows
Hadoop to extend and manage thousands of nodes and clusters.
• Client
• Resource Manager
1. Scheduler
2. Application manager
• Node Manager
• Application Master
• Container
Application workflow in Hadoop YARN
1. Client submits an application
2. The Resource Manager allocates a container to start the
Application Manager
3. The Application Manager registers itself with the Resource
Manager
4. The Application Manager negotiates containers from the
Resource Manager
5. The Application Manager notifies the Node Manager to
launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to
monitor application’s status
8. Once the processing is complete, the Application Manager
un-registers with the Resource Manager
Application : Workflow
YARN : MapReduce Beyond
• Using YARN lot of frameworks are able to utilize and connect to HDFS
• YARN – Open gate for frameworks, other search engines and even
other big data applications also
• Application of “HDFS ” increased because of YARN as resource provider
Hadoop Daemons:
Hadoop 2.X MapReduce Yarn Components:
YARN: Workflow