0% found this document useful (0 votes)
18 views

Map Reduce Algorithm 1

The document discusses graph algorithms and their implementation using MapReduce. It describes how graphs can be represented as nodes and edges, and gives examples of real-world graphs like social networks and web pages. It then explains that breadth-first search is well-suited for parallelization using MapReduce by processing nodes level-by-level. The document also introduces different ways of representing graphs, settling on the sparse matrix representation for efficient MapReduce processing.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Map Reduce Algorithm 1

The document discusses graph algorithms and their implementation using MapReduce. It describes how graphs can be represented as nodes and edges, and gives examples of real-world graphs like social networks and web pages. It then explains that breadth-first search is well-suited for parallelization using MapReduce by processing nodes level-by-level. The document also introduces different ways of representing graphs, settling on the sparse matrix representation for efficient MapReduce processing.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

represent the individuals and edges represents the connections

between them. The nodes and edges of a social network carry


other some information with them such as, associated with each
node(individual) is the age, sex, height, country etc., and
The optimal value for M here is . Thus, with one stage associated with each edges are information such as type of
parallelization, we have reduced the complexity from O(N) to relationship, strength of the relationship etc. If we take hyperlinks
O( ). Now, here it is assumed that all the parallel units are into consideration, the world wide web is full of web pages that
synchronized at both the input and the output so that there is no are connected to one another. These hyper links themselves from
latency. In such a case, the running time is proportional to the nodes of the graph and the weight of edges can represent
complexity. The complexity can be further reduced by number of hops required to reach another web page from the
incorporating multiple intermediate parallelization. For example, current page[6]. If we consider Google maps, that provides the
if we have two stages, with M mappers in the first stage and R in route for all locations, that in itself is a very huge graph, with
the next, the resulting complexity can be written as nodes representing locations. All these graphs discussed above are
very huge and in order analyze and derive metrics from these
graphs, sequential processing will never work out given the
magnitude. Hence MapReduce comes into picture. This is because
Without any constraint on the number of intermediate hops, the to store such huge graphs and processing them, is not manageable
optimal allocation will be N/2 mappers in the first stage, N/4 in in single system.
the second stage, and so on. This results in log2(N) stages. So the
total time complexity will be O( ). Thus we have reduced Breadth First Search Algorithm
the complexity from O(N) to O(logN). However, a practical issue In order to MapReduce graph algorithms, we need to find suitable
is the synchronization and the latency. Since the second stage can graph representations to store in the HDFS that facilitates efficient
operate only when the first stage has completed operations for all processing. Generally, these algorithms are iterative in nature. As
the mappers, there is a delay accumulation at each level. far as this algorithm is concerned, it moves level by level for each
iteration through the graph starting with a single node. In the next
level it processes all nodes connected to the first node and in the
3.4 Sorting Algorithm second level, all the nodes connected to those nodes are
Consider sorting algorithm as follows. The input to the sort processed[6]. This does not suite MapReduce programming
algorithm is a set of files with each line consisting of a value. The model. Passing the whole graph to the mappers consumes a lot of
mapper key is a tuple of file name and the line number. The memory and is wasteful. Hence it is essential to find appropriate
mapper value is the text content of the line[5]. One of the representations for the graphs.
important property of Hadoop (partitioning and shuffling) that can The usual representation of graphs as references from one node to
be leveraged for sorting algorithms is that, the key value pairs other will not work out as it is has linked list implementations and
output by the mappers are sorted and given to the reducers. This is cannot be serialized[5]. Also the representation of graph in the
represented using figure 4 that has been reproduced from [5]. form of adjacency matrix is also not suitable as the graph is so
Thus the mapper task keeps emitting the values associated with huge with lots of zeros or ones. They are also composed of
the keys. The mapper input tuple are then forwarded to reducers. unnecessarily long rows to process most of which would be zeros.
Due to the above mentioned property , the mapper to reducer Now consider spare matrix representation that has the list of all
forwarding is such that the smallest key is forwarded to the the adjacent nodes for every node along with their weight[5]. This
smallest reducer index and so on, the resulting combining is representation is simple and can be passed as arguments to the
automatically sorted [5]. mappers or reducers easily as these only have non zero elements
as part of the list. Thus it eliminates the necessity for storing huge
matrices most of the values of which is zero.
Let us consider one of the famous graph problems, finding the
shortest path from the source to all other nodes in the graph. This
problem is solved using Dijktra's algorithm. It is a sequential
algorithm. The algorithm works as follows. Initially, the distance
from the source to all the nodes are initialized to infinity except
for the distance to the source itself which is zero. The algorithms
maintains a priority queue, and distance to the nodes in the queue
is calculated staring with the node with minimum distance. In the
first iteration it will be the source node itself[6]. Hence the source
node is removed from the queue and the distance to all the nodes
Figure 4: Sorting with MapReduce [5] in the adjacency list of the retrieved node is calculated and the
source node is marked as visited. In this manner the algorithm
3.5 Graph Algorithms proceeds with each level of the graph. This is repeated until the
Firstly, for a graph problem to be considered for MapReduce
priority queue become empty. At that point shortest path from
algorithm, the magnitude of such a graph should be very huge.
source to all the nodes would have been computed[6].
Some of the fields that requires analysis of such huge graph are
social networks, web pages, location maps etc. A graph is Now this algorithm can be implemented in MapReduce using the
generally represented in terms of nodes and edges or links parallel breadth first search algorithm is the. Let us assume that in
connecting the nodes. In case of social media, the nodes ay this graph the distance between all the nodes equals 1. Let n
represent the node of the graph and N denotes the details the classifier and exploit this information to classify test cases that
corresponding to the node such as the distance to node from contains the attribute information without the classifier. A typical
source and the adjacency list of that node. The initialization is example of training and testing data set for the naive bayes model
same as in the dijkstra's algorithm. All the nodes are given as is given in figure 5 and figure 6 respectively.
inputs to the mappers and for every node that the mapper Table 5: Naive Bayes Training Data
processes, it emits a key values pair corresponding to the nodes in
the adjacency list [6]. The key is the node itself and the values is Attr. 1 Attr 2 Attr 3 Attr 4 Attr 5 Class
one added to current distance to the node. This is because, we can a 1 4 0 ‘High’ 2
say that, if a node can be reached with a distance of x then we can b 3 2 1 ‘Medium’ 3
reach a node connected to that node with a distance of x+1. Now d 2 3 0 ‘High’ 1
the reducers will get the key value pair that was output by the b 1 3 1 ‘Low’ 2
mappers i.e.(node, distance_to_node)[6]. From this the reducer
needs to choose the shortest distance corresponding to each node
Table 6: Naive Bayes Testing Data
and update it in the adjacency list. This process continues for the
next iteration. With every iteration we will get the shortest Attr 1 Attr 2 Attr 3 Attr 4 Attr 5 Class
distance to the nodes form the source in the next level. So if the d 3 3 0 ‘High’ ?
diameter of the graph (or) the maximum distance between any two b 1 4 1 ‘Medium’ ?
nodes fo the graph is D then we need to repeat this process for D-
1 iterations to find the shortest distance from the source to all the For example, for any attribute Ai, the algorithm needs to compute
nodes, assuming that the graph is fully connected[6].
Count of all rows with Ai  Sm[i ], C  ck
One main difference between the Dijktra's algorithm and the Pr( Ai  Sm[i ] | C  ck ) 
MapReduce version is that, the former, gives that path to the TotalNo of rows withC  ck
source form the node, but the latter only gives the shortest This needs to be repeated for all the states of Ai and for all the
distance. In order to overcome this we also need to emit the path attributes and for all states of the class variables. Let N be the
to the node in the mappers and update it[6]. number of attributes, C be the number of classes and S be the
Now if the edges have different weights then we need to alter the number of states for a given attribute. Then the complexity of the
adjacency list to accommodate the link weights as well. And above operation is O(CNS). However, the above operations are
while finding the distance we need to add the link weight instead highly amenable for parallel implementation. It has been shown in
of adding one to the previously calculated distance. [7] that the complexity can be roughly reduced by a factor of P
with a MapReduce framework. Here P is the number of cores in
The pseudo code for the map reduce functions is as below. the processor. The software implementation is as follows. Divide
the input data set into multiple subgroups. Each subgroup handles
Map(node n, node N)
certain tuples of (Attribute, State, Class).
d ← N.Distance
Emit(node n, N) K-Means Clustering
for all node m in AdjacencyList (N) do This is another important clustering algorithm used commonly in
emit(node m, d + 1) . machine learning. In K-means clustering [8], the following
operations take place
Reduce(node m, [d1, d2,...])
dmin ← ∞ Input: Random points in a multi-dimensional space
M ←null Initialization: Generate K points in the input space that are to be
for all d 2 counts [d1, d2,...] do used as cluster centroids.
if IsNode(d) then
Iteration:
M ←d.
else if d < dmin then Associate each of the points to the nearest cluster
dmin ← d Re-compute the cluster centroids
M.Distance ← dmin
emit(node m, node M) Output: Cluster index for each of the points

Pseudo Code for Parallel Breadth First Search, reproduced As can be seen from the above steps, the algorithm is amenable to
from reference [6] MapReduce architecture where cluster computation and cluster
association can be done in parallel. In [7], it has been shown that
3.6 Machine Learning and MapReduce the complexity can be reduced by a factor of P where P is the
Machine learning is an important application for MapReduce number of cores.
since most of the algorithms related to machine learning perform
data intensive applications. Some of the common algorithms in
4. CONCLUSION
the field of machine learning include, Naïve Bayes Classification Parallel computations have stated dominating today's world.
and K-Means Clustering. Given any problem, the best solution needs to achieve the results
efficiently. Solving a problem is no longer enough. In order to
Naïve Bayes Classification survive in this competitive world, everything should be performed
Here the input to the algorithm is a huge data set that contains the time efficiently. To achieve that, we need to move toward parallel
values for the multiple attributes and the corresponding classifier. computing. The programmers can no longer be kept blind about
The algorithm needs to learn the correlation of the attributes with the underlying hardware architecture. In order to scale up and

You might also like