Map Reduce Algorithm 1

The document discusses graph algorithms and their implementation using MapReduce. It describes how graphs can be represented as nodes and edges, and gives examples of real-world graphs like social networks and web pages. It then explains that breadth-first search is well-suited for parallelization using MapReduce by processing nodes level-by-level. The document also introduces different ways of representing graphs, settling on the sparse matrix representation for efficient MapReduce processing.

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Map Reduce Algorithm 1

Uploaded by

jefferyleclerc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

represent the individuals and edges represents the connections

between them. The nodes and edges of a social network carry

other some information with them such as, associated with each
node(individual) is the age, sex, height, country etc., and
The optimal value for M here is . Thus, with one stage associated with each edges are information such as type of
parallelization, we have reduced the complexity from O(N) to relationship, strength of the relationship etc. If we take hyperlinks
O( ). Now, here it is assumed that all the parallel units are into consideration, the world wide web is full of web pages that
synchronized at both the input and the output so that there is no are connected to one another. These hyper links themselves from
latency. In such a case, the running time is proportional to the nodes of the graph and the weight of edges can represent
complexity. The complexity can be further reduced by number of hops required to reach another web page from the
incorporating multiple intermediate parallelization. For example, current page[6]. If we consider Google maps, that provides the
if we have two stages, with M mappers in the first stage and R in route for all locations, that in itself is a very huge graph, with
the next, the resulting complexity can be written as nodes representing locations. All these graphs discussed above are
very huge and in order analyze and derive metrics from these
graphs, sequential processing will never work out given the
magnitude. Hence MapReduce comes into picture. This is because
Without any constraint on the number of intermediate hops, the to store such huge graphs and processing them, is not manageable
optimal allocation will be N/2 mappers in the first stage, N/4 in in single system.
the second stage, and so on. This results in log2(N) stages. So the
total time complexity will be O( ). Thus we have reduced Breadth First Search Algorithm
the complexity from O(N) to O(logN). However, a practical issue In order to MapReduce graph algorithms, we need to find suitable
is the synchronization and the latency. Since the second stage can graph representations to store in the HDFS that facilitates efficient
operate only when the first stage has completed operations for all processing. Generally, these algorithms are iterative in nature. As
the mappers, there is a delay accumulation at each level. far as this algorithm is concerned, it moves level by level for each
iteration through the graph starting with a single node. In the next
level it processes all nodes connected to the first node and in the
3.4 Sorting Algorithm second level, all the nodes connected to those nodes are
Consider sorting algorithm as follows. The input to the sort processed[6]. This does not suite MapReduce programming
algorithm is a set of files with each line consisting of a value. The model. Passing the whole graph to the mappers consumes a lot of
mapper key is a tuple of file name and the line number. The memory and is wasteful. Hence it is essential to find appropriate
mapper value is the text content of the line[5]. One of the representations for the graphs.
important property of Hadoop (partitioning and shuffling) that can The usual representation of graphs as references from one node to
be leveraged for sorting algorithms is that, the key value pairs other will not work out as it is has linked list implementations and
output by the mappers are sorted and given to the reducers. This is cannot be serialized[5]. Also the representation of graph in the
represented using figure 4 that has been reproduced from [5]. form of adjacency matrix is also not suitable as the graph is so
Thus the mapper task keeps emitting the values associated with huge with lots of zeros or ones. They are also composed of
the keys. The mapper input tuple are then forwarded to reducers. unnecessarily long rows to process most of which would be zeros.
Due to the above mentioned property , the mapper to reducer Now consider spare matrix representation that has the list of all
forwarding is such that the smallest key is forwarded to the the adjacent nodes for every node along with their weight[5]. This
smallest reducer index and so on, the resulting combining is representation is simple and can be passed as arguments to the
automatically sorted [5]. mappers or reducers easily as these only have non zero elements
as part of the list. Thus it eliminates the necessity for storing huge
matrices most of the values of which is zero.
Let us consider one of the famous graph problems, finding the
shortest path from the source to all other nodes in the graph. This
problem is solved using Dijktra's algorithm. It is a sequential
algorithm. The algorithm works as follows. Initially, the distance
from the source to all the nodes are initialized to infinity except
for the distance to the source itself which is zero. The algorithms
maintains a priority queue, and distance to the nodes in the queue
is calculated staring with the node with minimum distance. In the
first iteration it will be the source node itself[6]. Hence the source
node is removed from the queue and the distance to all the nodes
Figure 4: Sorting with MapReduce [5] in the adjacency list of the retrieved node is calculated and the
source node is marked as visited. In this manner the algorithm
3.5 Graph Algorithms proceeds with each level of the graph. This is repeated until the
Firstly, for a graph problem to be considered for MapReduce
priority queue become empty. At that point shortest path from
algorithm, the magnitude of such a graph should be very huge.
source to all the nodes would have been computed[6].
Some of the fields that requires analysis of such huge graph are
social networks, web pages, location maps etc. A graph is Now this algorithm can be implemented in MapReduce using the
generally represented in terms of nodes and edges or links parallel breadth first search algorithm is the. Let us assume that in
connecting the nodes. In case of social media, the nodes ay this graph the distance between all the nodes equals 1. Let n
represent the node of the graph and N denotes the details the classifier and exploit this information to classify test cases that
corresponding to the node such as the distance to node from contains the attribute information without the classifier. A typical
source and the adjacency list of that node. The initialization is example of training and testing data set for the naive bayes model
same as in the dijkstra's algorithm. All the nodes are given as is given in figure 5 and figure 6 respectively.
inputs to the mappers and for every node that the mapper Table 5: Naive Bayes Training Data
processes, it emits a key values pair corresponding to the nodes in
the adjacency list [6]. The key is the node itself and the values is Attr. 1 Attr 2 Attr 3 Attr 4 Attr 5 Class
one added to current distance to the node. This is because, we can a 1 4 0 ‘High’ 2
say that, if a node can be reached with a distance of x then we can b 3 2 1 ‘Medium’ 3
reach a node connected to that node with a distance of x+1. Now d 2 3 0 ‘High’ 1
the reducers will get the key value pair that was output by the b 1 3 1 ‘Low’ 2
mappers i.e.(node, distance_to_node)[6]. From this the reducer
needs to choose the shortest distance corresponding to each node
Table 6: Naive Bayes Testing Data
and update it in the adjacency list. This process continues for the
next iteration. With every iteration we will get the shortest Attr 1 Attr 2 Attr 3 Attr 4 Attr 5 Class
distance to the nodes form the source in the next level. So if the d 3 3 0 ‘High’ ?
diameter of the graph (or) the maximum distance between any two b 1 4 1 ‘Medium’ ?
nodes fo the graph is D then we need to repeat this process for D-
1 iterations to find the shortest distance from the source to all the For example, for any attribute Ai, the algorithm needs to compute
nodes, assuming that the graph is fully connected[6].
Count of all rows with Ai  Sm[i ], C  ck
One main difference between the Dijktra's algorithm and the Pr( Ai  Sm[i ] | C  ck ) 
MapReduce version is that, the former, gives that path to the TotalNo of rows withC  ck
source form the node, but the latter only gives the shortest This needs to be repeated for all the states of Ai and for all the
distance. In order to overcome this we also need to emit the path attributes and for all states of the class variables. Let N be the
to the node in the mappers and update it[6]. number of attributes, C be the number of classes and S be the
Now if the edges have different weights then we need to alter the number of states for a given attribute. Then the complexity of the
adjacency list to accommodate the link weights as well. And above operation is O(CNS). However, the above operations are
while finding the distance we need to add the link weight instead highly amenable for parallel implementation. It has been shown in
of adding one to the previously calculated distance. [7] that the complexity can be roughly reduced by a factor of P
with a MapReduce framework. Here P is the number of cores in
The pseudo code for the map reduce functions is as below. the processor. The software implementation is as follows. Divide
the input data set into multiple subgroups. Each subgroup handles
Map(node n, node N)
certain tuples of (Attribute, State, Class).
d ← N.Distance
Emit(node n, N) K-Means Clustering
for all node m in AdjacencyList (N) do This is another important clustering algorithm used commonly in
emit(node m, d + 1) . machine learning. In K-means clustering [8], the following
operations take place
Reduce(node m, [d1, d2,...])
dmin ← ∞ Input: Random points in a multi-dimensional space
M ←null Initialization: Generate K points in the input space that are to be
for all d 2 counts [d1, d2,...] do used as cluster centroids.
if IsNode(d) then
Iteration:
M ←d.
else if d < dmin then Associate each of the points to the nearest cluster
dmin ← d Re-compute the cluster centroids
M.Distance ← dmin
emit(node m, node M) Output: Cluster index for each of the points

Pseudo Code for Parallel Breadth First Search, reproduced As can be seen from the above steps, the algorithm is amenable to
from reference [6] MapReduce architecture where cluster computation and cluster
association can be done in parallel. In [7], it has been shown that
3.6 Machine Learning and MapReduce the complexity can be reduced by a factor of P where P is the
Machine learning is an important application for MapReduce number of cores.
since most of the algorithms related to machine learning perform
data intensive applications. Some of the common algorithms in
4. CONCLUSION
the field of machine learning include, Naïve Bayes Classification Parallel computations have stated dominating today's world.
and K-Means Clustering. Given any problem, the best solution needs to achieve the results
efficiently. Solving a problem is no longer enough. In order to
Naïve Bayes Classification survive in this competitive world, everything should be performed
Here the input to the algorithm is a huge data set that contains the time efficiently. To achieve that, we need to move toward parallel
values for the multiple attributes and the corresponding classifier. computing. The programmers can no longer be kept blind about
The algorithm needs to learn the correlation of the attributes with the underlying hardware architecture. In order to scale up and

Calculated Fields in Workday
100% (5)
Calculated Fields in Workday
203 pages
SAP Financial Closing Cockpit: November 2015
No ratings yet
SAP Financial Closing Cockpit: November 2015
17 pages
8300 Gui SV
No ratings yet
8300 Gui SV
22 pages
8300 17977 1 PB
No ratings yet
8300 17977 1 PB
19 pages
Map Reduce Algorithm
No ratings yet
Map Reduce Algorithm
2 pages
Sarthak Tomar53 Unit-4 DAA
No ratings yet
Sarthak Tomar53 Unit-4 DAA
9 pages
Copy of UNIT III GRAPHS-converted
No ratings yet
Copy of UNIT III GRAPHS-converted
98 pages
Graph Algorithms (1)
No ratings yet
Graph Algorithms (1)
33 pages
Mlds Unit 2
No ratings yet
Mlds Unit 2
10 pages
(r17a0528) Big Data Analytics-52-95
No ratings yet
(r17a0528) Big Data Analytics-52-95
44 pages
Graphs PDF
No ratings yet
Graphs PDF
66 pages
Graph[1]
No ratings yet
Graph[1]
21 pages
Unit IV - Graph
No ratings yet
Unit IV - Graph
7 pages
8.4 Introduction To Graph: By:Amun Parajuli
No ratings yet
8.4 Introduction To Graph: By:Amun Parajuli
13 pages
Unit 4 Graph
No ratings yet
Unit 4 Graph
16 pages
Graph (abstract data type) - Wikipedia
No ratings yet
Graph (abstract data type) - Wikipedia
7 pages
Map reduce
No ratings yet
Map reduce
35 pages
Algorithm Design Unit 3
No ratings yet
Algorithm Design Unit 3
4 pages
Unit - 4
No ratings yet
Unit - 4
22 pages
ds4
No ratings yet
ds4
15 pages
Algorithms
No ratings yet
Algorithms
49 pages
Lab Manual 12 DSA
No ratings yet
Lab Manual 12 DSA
20 pages
MR Databases
No ratings yet
MR Databases
52 pages
Graph
No ratings yet
Graph
9 pages
UnitIVofDSpdf__2024_11_23_23_52_16 (1)
No ratings yet
UnitIVofDSpdf__2024_11_23_23_52_16 (1)
54 pages
DSA Day 4
No ratings yet
DSA Day 4
7 pages
Graph Theory - Introduction
No ratings yet
Graph Theory - Introduction
5 pages
micro2
No ratings yet
micro2
1 page
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Network Model
No ratings yet
Network Model
9 pages
DSA - Unit V
No ratings yet
DSA - Unit V
18 pages
Huffman Codes: Spanning Tree
No ratings yet
Huffman Codes: Spanning Tree
6 pages
UNIT V - Graphs
No ratings yet
UNIT V - Graphs
18 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
GRAPHS
No ratings yet
GRAPHS
4 pages
DS-UNIT-5
No ratings yet
DS-UNIT-5
17 pages
Towards Efficient Mapreduce Using Mpi
No ratings yet
Towards Efficient Mapreduce Using Mpi
10 pages
Graphs
No ratings yet
Graphs
51 pages
Graphs - 1
No ratings yet
Graphs - 1
20 pages
Ankur Graph[1]
No ratings yet
Ankur Graph[1]
12 pages
Unit 2 Algorithm
No ratings yet
Unit 2 Algorithm
39 pages
Unit IV College Notes
No ratings yet
Unit IV College Notes
12 pages
Graph Data Structure: Breadth First Search (BFS)
No ratings yet
Graph Data Structure: Breadth First Search (BFS)
11 pages
Unit 5
No ratings yet
Unit 5
36 pages
Graph and Their Representation
No ratings yet
Graph and Their Representation
34 pages
Graph Notes
No ratings yet
Graph Notes
16 pages
BFSMpR:A BFS Graph Based Recommendation System Using Map Reduce
No ratings yet
BFSMpR:A BFS Graph Based Recommendation System Using Map Reduce
5 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Chapter 6 - DS
No ratings yet
Chapter 6 - DS
67 pages
Data Structures Unit 4
No ratings yet
Data Structures Unit 4
38 pages
UNIT 4-Graph-1
No ratings yet
UNIT 4-Graph-1
59 pages
Graph Algorithm
No ratings yet
Graph Algorithm
44 pages
Graphs Lectures
No ratings yet
Graphs Lectures
44 pages
L4-GraphAlgorithms v4
No ratings yet
L4-GraphAlgorithms v4
56 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Daa PR06 123
No ratings yet
Daa PR06 123
13 pages
CC UNIT-7
No ratings yet
CC UNIT-7
16 pages
14 MapReduce PDF
100% (1)
14 MapReduce PDF
82 pages
14 MapReduce
100% (1)
14 MapReduce
82 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Chapter 11-Graphs and Their Applications
No ratings yet
Chapter 11-Graphs and Their Applications
50 pages
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
MapReduce - What It Is, and Why It Is So Popular
No ratings yet
MapReduce - What It Is, and Why It Is So Popular
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
4 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
7 pages
Paper Dvi
No ratings yet
Paper Dvi
7 pages
Balanced K-Means Revisited-5
No ratings yet
Balanced K-Means Revisited-5
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
3 pages
Hadoop
No ratings yet
Hadoop
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
2 pages
2 Mapreduce Model Principles
No ratings yet
2 Mapreduce Model Principles
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
2 pages
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
No ratings yet
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
1 page
Balanced K-Means Revisited-1
No ratings yet
Balanced K-Means Revisited-1
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
6 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
4 pages
Fast Scalable K-Means++ Algorithm With Mapreduce
No ratings yet
Fast Scalable K-Means++ Algorithm With Mapreduce
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
3 pages
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
No ratings yet
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
3 pages
K-Means Clustering Optimization Algorithm Based On Mapreduce
No ratings yet
K-Means Clustering Optimization Algorithm Based On Mapreduce
6 pages
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
No ratings yet
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
4 pages
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
No ratings yet
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
3 pages
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
No ratings yet
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
42 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
No ratings yet
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
4 pages
Pradeep Resume
No ratings yet
Pradeep Resume
3 pages
Tutorial EN IrriPro3.7X PDF
No ratings yet
Tutorial EN IrriPro3.7X PDF
63 pages
Dataengieer
No ratings yet
Dataengieer
23 pages
Ebook csc204
No ratings yet
Ebook csc204
184 pages
WMS MSCA Presentation
100% (2)
WMS MSCA Presentation
53 pages
Computing & Business Applications (MBA) Topic 1: Network Computing: Discovery, Communication, and Collaboration
No ratings yet
Computing & Business Applications (MBA) Topic 1: Network Computing: Discovery, Communication, and Collaboration
31 pages
Data Mining and Knowledge Discovery: Applications, Techniques, Challenges and Process Models in Healthcare
No ratings yet
Data Mining and Knowledge Discovery: Applications, Techniques, Challenges and Process Models in Healthcare
7 pages
Duranas 1000: Rugged Network Attached Storage (Nas) - Ip65, En50155
No ratings yet
Duranas 1000: Rugged Network Attached Storage (Nas) - Ip65, En50155
2 pages
Nutanix Clasisc
No ratings yet
Nutanix Clasisc
268 pages
Composite Screen Builder
No ratings yet
Composite Screen Builder
11 pages
Dlpu 078
No ratings yet
Dlpu 078
128 pages
ABB RTU500 Series User Manual
No ratings yet
ABB RTU500 Series User Manual
90 pages
De Mc Smo Sys en 01 v4 3 1 Cnrszr
No ratings yet
De Mc Smo Sys en 01 v4 3 1 Cnrszr
388 pages
COMPACTLINE Automated Welding Machine - Vario
No ratings yet
COMPACTLINE Automated Welding Machine - Vario
4 pages
Aspera Faspex 3.7.5 Linux Admin
No ratings yet
Aspera Faspex 3.7.5 Linux Admin
179 pages
Kanban
No ratings yet
Kanban
12 pages
CSIS 3300 W13 Transactions
No ratings yet
CSIS 3300 W13 Transactions
13 pages
Uml History
No ratings yet
Uml History
3 pages
Matrix Software User Manuel
No ratings yet
Matrix Software User Manuel
45 pages
Business Rule Framework (BRF) : Purpose
No ratings yet
Business Rule Framework (BRF) : Purpose
3 pages
323-1851-102.5 (6500 R11.2 BB SMUX OTNFLEXMOTR CPS) Issue2
No ratings yet
323-1851-102.5 (6500 R11.2 BB SMUX OTNFLEXMOTR CPS) Issue2
549 pages
Master of Technology in VLSI Design Tools and Technology: Interdisciplinary Program Overall Credit Structure
No ratings yet
Master of Technology in VLSI Design Tools and Technology: Interdisciplinary Program Overall Credit Structure
4 pages
5G-NTN (Non-Terrestrial Networks) Overview
No ratings yet
5G-NTN (Non-Terrestrial Networks) Overview
14 pages
Oceanstor 5300 V3&5500 V3 Storage System V300R003 Quick Installation Guide
No ratings yet
Oceanstor 5300 V3&5500 V3 Storage System V300R003 Quick Installation Guide
25 pages
Junos Security Swconfig Security
No ratings yet
Junos Security Swconfig Security
1,074 pages
Diasonic DDR-5100 Digital Voice Recorder Instructions
No ratings yet
Diasonic DDR-5100 Digital Voice Recorder Instructions
2 pages
Week 1
No ratings yet
Week 1
51 pages
MiCoach SPEED - CELL User Manual - en
No ratings yet
MiCoach SPEED - CELL User Manual - en
23 pages

Map Reduce Algorithm 1

Uploaded by

Map Reduce Algorithm 1

Uploaded by

represent the individuals and edges represents the connections

between them. The nodes and edges of a social network carry

You might also like