0% found this document useful (0 votes)

268 views

Stateful MapReduce PDF

This document proposes a Stateful MapReduce API that allows states to be maintained across multiple MapReduce jobs in Hadoop. This would enable a wider range of applications to be implemented using Hadoop by eliminating the need to develop new frameworks. The API allows user-accessible states to be stored and retrieved from tasks. Experimental results demonstrate the effectiveness of the proposed extensions to Hadoop. Examples where stateful MapReduce could be useful include online/incremental analytics like sessionization and iterative algorithms like PageRank.

Uploaded by

metatron13

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

268 views

Stateful MapReduce PDF

Uploaded by

metatron13

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Stateful MapReduce

Ahmed Elgohary
Electrical and Computer Engineering Department
University of Waterloo
200 University Avenue West, Waterloo, ON, Canada

[email protected]

ABSTRACT
Hadoop is considered the cornerstone of todays cloud analytics. Much work is being carried out towards developing and enhancing its capabilities. However, an opposite
research direction has started to emerge. In that direction,
researchers are arguing that Hadoop is not suitable for some
applications so, new frameworks need to be developed. Examples on such applications are graph analytics, online incremental processing, and iterative algorithms.
In this paper it is envisioned that by adding and maintaining
states across multiple Hadoop jobs, a wide range of applications will perfectly fit into Hadoop eliminating the need
to develop new frameworks. A Stateful MapReduce API
in addition to efficient design and implementation to extend
Hadoop are presented. Our experimental evaluation demonstrates the effectiveness of the proposed extensions.

map function is invoked for each key-value pair of the input

dataset. The output of the map function is also a set of
key-value pairs. In the reduce phase, the resulting key-value
pairs of the map phase are grouped based on the key and for
each group (key and set of values) the reduce function is invoked. Reduce output is also a set of key-value pairs which
combination represents the job output. Map and Reduce
functions can be denoted as:
map(KeyIn, ValIn):<KeyOut, ValOut>
reduce(KeyIn, <List>ValsIn):<KeyOut, ValOut>

Hadoop[7] is the most commonly used MapReduce implementation. Much work is being carried out to develop and
improve its capabilities. For example, the authors in [2] presented policies for grouping and scheduling multiple MapReduce jobs in order to improve the overall system throughCategories and Subject Descriptors
put. In [12] opportunities for sharing portions of the work
D.3 [Programming Techniques]: Concurrent Programcarried out by multiple MapReduce jobs were identified. An
ming, Distributed programming
analytical model for grouping jobs together was accordingly
developed. Another interesting direction towards Hadoop
General Terms
development was presented in [8]. In that work, the authors
Design
considered the problem of automatically tuning Hadoop parameters based on the expected behaviour of the submitted
jobs. Also, [1] presented a hybrid model to combine Hadoop
Keywords
Distributed computing, Cloud Computing, MapReduce, Hadoop with relational databases in order to enhance the systems
performance.

INTRODUCTION

MapReduce[6] was presented as a framework to facilitate

large scale data analytics over a distributed environment of
commodity machines. Basically, two functions (map and reduce) need to be implemented by users and the underling
framework takes care of partitioning and executing the job
over the available machines in the cluster. The framework
also takes the responsibility of handling machine failures
and dealing with the heterogeneity of their specifications.
The processed dataset is modelled as key-value pairs. The

Recently, researcher started to argue that Hadoop(or MapReduce framework in general) is not suitable for some applications so, new frameworks need to be developed to suit these
applications. For examples, the authors in [11] stated that
graph algorithms does not fit into MapReduce. So, they
built a totally new framework (Pregel) designed specifically
for graph processing. In [10] a new architecture for stateful
bulk processing for dataflow programs was presented. In [5]
the authors were concerned about using Hadoop for online
analytics due to the latency introduced by materializing the
intermediate date. Hence, they proposed a modified MapReduce architecture that allows data to be pipelined between
operators. For iterative processing using Hadoop, the authors in [4] developed Haloop in which loop-invariant data
are cached locally at the worker machines.
It can be noticed from the paragraph above that eventually
we will end up with several frameworks (totally new frameworks or different variations of Hadoop). In this paper, we

envision that by only adding and maintaining states across

multiple job submissions, all the above applications will perfectly fit into Hadoop. We presented Stateful MapReduce
in which user accessible states are maintained internally by
the system. We show that these states can be exploited to
ease the programming of many applications into MapReduce
and also to enhance the performance of those applications.
Efficient Hadoop extensions are proposed so that maintaining the states gets achieved with the minimum additional
overhead.
The reset of this paper is organized as follows. In Section 2,
the Stateful MapReduce API is described followed by Section 3 in which examples on applications that can utilize the
states are presented. Section 4 gives the details of our design
and implementation to extend Hadoop to support the Stateful API. In Section 5, our evaluation process is described followed by the obtained results. In Section 6, we discuss the
achieved results and presented our next directions. Finally,
the paper is concluded in Section 7.

STATEFUL MAPREDUCE API

We modified MapReduce API to provide users with an access to the states. Users can store/retrieve key-value pairs
to/from the state of each task (Mapper or Reducer). The
Stateful Mapper/Reducer are defined as:
map(keyIn, valIn, State): <KeyOut, ValOut>
reduce(keyIn, <List> ValsIn, State): <Keyout, ValOUt>
Users can access State as follows:
int count = state.get("count")
state.set("count", count)
Users also need to specify which tasks should be stateful and
which tasks should be stateless. The API is flexible enough
so users can combine stateful and stateless tasks in the same
job.

StatefulJob job = new StatefulJob ()

job.setStatefulMapper(StatefulMapper, IsReadOnly)
job.setStatefulReducer(StatefulReducer, IsReadOnly)
job.submit()
job.resubmit()
.
.
Initialization functions can be provided to the system so the
system sets the initial state of each task. State initialization
is up to the user preferences. For examples, initial states
might be loaded from HBase or an HDFS file, or set programmatically.

EXAMPLES

In this sections examples on applications for which Stateful

MapReduce can be exploited.

3.1

Online/Incremental Analytics: Sessionization

In online/incremental analytics, users have datasources that

continuously produce new data and the task usually is to
measure some statistics or obtain certain results using the
data that the system has received so far. The traditional
way for carrying out this task using Hadoop is to collect
the data received so far and resubmit it over and over again
into MapReduce Jobs. This approach results in significant
amount of redundant communication and computation costs.
Even the work done in [5] was concerned with pipelining the
intermediate data across tasks which does tackle the problem of repeating the communications and computations of
these intermediate data.
Using Stateful MapReduce, users can use the states to store
the computed results out of the received data so far. When
new data arrives at the system, only the new data is submitted as the Job input to be used to update the stored
states. This way, the system will only process the newly
arriving data which eliminates all the redundant communications and computations.
As an example, we consider here the sessionization problem
in which user clicks are analyzed to determine how many
urls each user has clicked so far. Here a stateless mapper is
used to extract user ID from each input record. A stateful
reducer is used. In the state of each reduce task, the total
number of the urls each user has clicked (denoted by count
in the pseudo code below). As the reduce function is invoked
for each user ID (reduce KeyIn), the total count is retrieved,
updated and stored back again. The pseudo code of the map
and reduce functions is provided below:
map(logRecordId, logRecord)
UserId = extractUserID(logRecord)
EmitIntermediate (userId, 1)

reduce(UserId, <List> Clicks, State)

count = state.get("count")
count = count + sum(clicks)
state.set("count", count)
Emit(UserId, count)

3.2

Iterative Algorithms: PageRank

Iterative Algorithms can be executed on top of Hadoop by

considering each job submission a single iteration. To do
multiple iterations, the job gets resubmitted to the system
after changing the appropriate settings. It was noticed in
[4] that the performance of the iterative algorithms over
Hadoop can be improved by caching loop-invariant data locally at the worker nodes which eliminates the need to redundantly include this data in the communication between
the nodes of system. Using Stateful MapReduce the states
can be used to store loop-invariant data and the underlying state management components takes the responsibility
of maintaining these states over the different iterations (job
submissions).
As an example on iterative algorithm, we considered PageRank. Using Hadoop or Haloop[4], each PageRank iteration

is carried out using two MapReduce Jobs. Using stateful

MapReduce, each iteration can be expressed using a single
job. Here we will use a read-only stateful mapper and a
stateful reducer. The state of each mapper includes the list
of the hyperlinks of each page and the state of each reducer
includes the current rank of each page. The input to the map
function is the url and its current rank. The map function,
outputs each of the hyperlinks along with the addition portion of the url rank it will receive. At the reduce phase, the
portions each hyperlink has received in addition to its current rank are aggregated. The following pseudo code shows
both the map and reduce function.
map(url, currentRank, State)
Iterator hyperlinks = State.get("hyperlinks")
rankPortion = currentRank/size(hyperlinks)
while(hyperlinks.hasNext())
EmitIntermediate(hyperlinks.next(), rankPortion)

reduce(hyperlink, <List> rankPortions, State)

currentRank = state.get("current-rank")
newRank = currentRank + sum(rankPortions)
state.set("current-rank", newRank)
Emit(hyperlink, newRank)

3.3

Graph Processing: Shortest Path

The authors of [11] argued that MapReduce is not suitable

for graph processing. Carrying out graph algorithms using
Hadoop results in a significant amount of redundant communications is wasted due to resubmitting the graph structure
as an input at each new submission. However, this problem
can be solved by storing the graph structure as states which
eliminates the redundant communications. The basic idea
that Pregel relies on is message passing between the nodes
of the graph. It was mentioned that graph algorithms can
be expressed easily in this way.
Using the Stateful MapReduce, a read only stateful mapper
and a stateful reducer are used. Mapper state include the
outgoing edges and their weights of each node. In the map
phase, each node can send a message to its neighbors by
simply outputting the neighbour ID along with the message.
MapReduce framework will collect the messages sent to each
node and pass them to that node in the reduce phase.
For instance, we considered the single source shortest path
problem. Mapper input the the current distance to each
node. The mapper sends a message to the neighbors of each
node node indicating a possible new distance to that neighbor. In the reduce phase, the message with the minimum
distance is obtained and the nodes minimum distance is
updated. The current distance to each node is stored as a
reduce state. The following pseudo code shows the implementation of map and reduce functions.

map(nodeID, minDistance, State)

Iterator outEdges = state.get("outgoing-edges")
while(outEdges.hasNext())
edge = outEdges.next()
EmitIntermediate(edge.toNodeID,

Jobs Queue

Job Tracker
New Job

Init Job
Create Tasks (set Backup state) Scheduler
Add
Add Job to Queue
Schedule Tasks on Try to Schedule on the

BackupStates Table

TTaskTracker

previous TaskTracker

Key: <jobName, M/R, IdWithinJob>

Value: <off, len, previous TaskTracker>

Heartbeat
Communication

HDFS

Store
State

Retrieve
State

Execution JVM
Retrieve State from HDFS if
needed
Invoke Stateful API
Write new state to HDFS
Return new State & Backup
Location

RPC-Based
Protocol

Task Tracker
Retrieve previous state from table
fter task execution
Update state after

States Table

Figure 1: The Modified Hadoop Architecture to

Support Stateful MapReduce

minDistance+edge.weight)

reduce(nodeID, <List> distances, State)

currentDistance = state.get("current-distance")
newMinDistance = min(distances, currentDistance)
state.set("current-distance", newMinDistance)
Emit(nodeID, newMinDistance)

DESIGN AND IMPLEMENTATION

In this section, the proposed design and implementation details of extending Hadoop to support the Stateful MapReduce API described in Section 2 are given. In order for the
Stateful MapReduce to be acceptable, maintaining states
should be achieved with the minimum additional overhead.
Also, the new API should not affect the scalability and the
fault tolerance of Hadoop.
In the basic architecture of Hadoop a JobTracker process
runs on the master node and a TaskTracker process runs on
each slave node. When a job is submitted to the system, the
JobTracker initializes the job, creates the map and reduce
tasks, and then adds the job to the execution queue. TaskTrackers communicate with the JobTracker in a heartbeat
communication. When the JobTracker receives a heartbeat
from a TaskTracker indicating that this TaskTracker can accept new tasks, Task scheduler picks the suitable tasks from
the jobs queue and assigns them to that TaskTracker. The
task scheduler tries to assign map tasks on the same machines where their inputs exist. TaskTracker creates a new
execution JVM for each task.
The proposed extensions are based on: 1) States are maintained locally at Task Trackers, 2) A Persistent copy of each
state is written to HDFS and 3) At the end of each task, the
JobTracker is informed with the location of the persistent
state of each task.
Figure 1 shows our modifications to the overall system architecture. BackUpState table is maintained by the JobTracker to store the location of the persistent state of each

At each TaskTracker an in-memory table is used to store the

states of the previously run tasks on that machine. When a
new task gets scheduled on a certain node, the TaskTracker
on that node tries to retrieve the state of this task from its
States table, otherwise the state will be retrieved from the
persistent storage by the execution JVM. At the end of task
execution, a persistent copy of the task is written to the persistent storage and the states table of the TaskTracker is also
updated with the new state. In the current implementation
HDFS is used as the persistent storage. However, any other
persistent storage like HBase can be used. The Backup state
table of the JobTracker is updated with the location of the
persistent copy of each state.
Users can indicate that task state is ready only which means
that the state is read only using the initialization function
provided by the user. For example, the mapper states of the
PageRank and the Shortest Path examples described above
in 3.2 and 3.3. In this case, the system does not store a persistent copy. Instead, the system invokes the initialization
function to recreate the task state when needed.
One important design consideration was maximizing the optimization opportunities to make it easy to integrate the
stateful MapReduce with Hadoop improvements that are
being developed. In the presented design, the system treats
each resubmission of each job as a new job allowing the system to carry out all the possible optimization depending on
the current systems status. It also worth noting that in our
design each task does at most one persistent storage access
to retrieve/store all the states of the map or reduce keys
processed in it which reduce the incurred latency resulting
from maintaining states.

EVALUATOIN

In this section, the experimental evaluation of the stateful

MapReduce is described followed by the achieved results.
The sessionization examples given in section 3.1 is considered as evaluation task. We used the WorldCup98 dataset
[3]which consists of the requests made to the 1998 World
Cup Web site between April 30, 1998 and July 26, 1998.
Each log entry contains request timestamp, client ID and
Object ID. The goal of our task was to count the number of
the objects requested by each user as indicated in the logs
received so far. The total size of the dataset was 13.3GB.
To simulate the online incremental processing, the dataset
was partitioned to 10 equal size portions and the system is
notified by the arrival of each portion separately. We com-

Comparing Running Time of Sessionization Task using Stateless and Stateful MapReduce
35

Stateless MapReduce
Stateful MapReduce

25
Running Time (mins)

task along with the previous TaskTracker on which the task

was run (which is the TaskTracker that stores the state locally). When receiving a new submission of a stateful job,
the JobTracker adds the backup information to each initialized task. Running a task on the same TaskTracker it
was run in the previous submission becomes an additional
scheduling criteria of the Task Scheduler. For a reduce task,
the scheduler tries first to find a task which state is stored
locally (in memory) on the TaskTracker. If it fails, it tries
to find a task which persistent state is stored at the TaskTracker. If it fails, it schedules any available task of the
TaskTracker and only in this case, the state is retrieved from
a remote machine. Improving the scheduling of stateful map
tasks is considered as a future work.

0
1(1.33)

2(2.66)

3(3.99)

4(5.32) 5(6.65) 6(7.98) 7(9.31)

Run Number (Data Size GB)

8(10.64) 9(11.97) 10(13.3)

Figure 2: Comparing the Running Time of Stateful

and Stateless Sessionization Task

pared the running time of two MapReduce jobs: 1) Stateless MapReduce job and 2) Stateful MapReduce job. In the
stateless job, the the system combines all the logs received
after each notification and resubmit all of them as a new
MapReduce job. In the stateful job, stateful reducers are
used to maintain the total number of objects requested by
each user so far and only the newly arriving logs are submitted as the job input. The running time of processing
each notification is recorded in addition to the latency overhead introduced by maintaining the states in the stateful
program.
The used evaluation infrastructure consisted of a cluster of
10 slave Amazon EC2 small instances in addition to 1 master
small instance. Each instance had 1.7GB memory, 1 EC2
Compute Unit 1 virtual core with 1 EC2 Compute Unit)
and 160GB local storage. All the machines were running
fedora-core linux, java 1.6.0 07, and Hadoop 0.203.0. We
created a new customized Amazon Machine Image (AMI)
on which the Stateful MapReduce implementation inside
Hadoop 0.203.0 was deployed and recreated a similar cluster to run the stateful jobs. All the default Hadoop configurations were not changed expect for the number of the
reducers. We used 25 reducers for both experiments.
Figure 2 shows the running time of jobs launched after notification. Using stateless MapReduce, the running time keeps
increasing as more data is received which indicates the low
performance of the stateless MapReduce when used in such
applications especially when the much data needs to be processed. On the other hand, stateful MapReduce achieves
almost constant running time as more data arrives to the
system since it avoids all the redundant communications
(resubmitting all the previously received logs after each new
notification) and computations (recounting the number of
the objects requested by each user).
To provide an estimate for the incurred overhead resulting
from maintaining states, the latency of writing each task
state to the persistent storage (HDFS) in addition to the
size of the state. Figure 3 shows the average latency and

Online MapReduce[5] can be considered a complementary work to the stateful MapReduce since online MapReduce is concerned with avoiding the latency of materializing the intermediate data while our work is concerned with avoid the latency of repeating computations and data transfers.

Overhead of Storing a Persistent Copy of each Task State

5.8

800

Average Latency (Sec)

Average State Size (KB)

5.6
5.4

700

600

4.8
4.6
500
4.4

Average State Size (KB)

Average Latency (Sec)

5.2

Building a system that is aware of the states introduces a lot of optimizations opportunities. For example as described in section 4 the scheduler at the JobTracker utilizes the information about the TaskTracker
on which a task was previously run to make a better
scheduling design to avoid retrieving states from the
persistent storage.

4.2
400
4
3.8
300
3.6
1

5
6
Run Number

Figure 3: The Average Incurred Latencies Caused

By Maintaining Persistent Copy of Each Task State
state size of the tasks of each new job submission. As more
logs are processed by the system, the more the state size becomes since more userIDs are encountered in the logs. However, the latency of writing the state almost does not get
affected as the size of the state increases. The figure also
shows that the latency lies in the range [3.7 to 5.7] seconds
which can be considered an accepted cost compared to the
significant savings gained by using the stateful API. However, the latency of using other persistent storage needs to be
compared to writing directly to HDFS which is considered
one of the possible future investigations.

A second set of experiments in which we investigate the performance of stateful MapReduce using other types of jobs
are currently in progress. In these experiments we consider
the PageRank an the Single Source Shortest Path problems
described in 3.2 and 3.3 respectively. We prepared a semisynthetic large graph dataset. The LiveJournal [9] graph
that consists of 4847571 nodes and 68993773 edges is used.
Weights for the edges were generated randomly from the
range [0,1]. To enlarge the size of the dataset, a long string
was appended to each node Id. We managed to make the
graph size around 12GB. We plan to compare the running
time of stateless and stateful versions of the two jobs.
There are other three possible directions to investigate towards the development of the stateful MapReduce:

1. We need to assess the latency incurred when using

other persistent storage like HBase for example.

DISCUSSION

This paper started with defining the stateful MapReduce

and its different applications and benefits, then we moved
to its design and implementation details inside the most
commonly used MapReduce implementation Hadoop. Afterwards, our experiments to assess the performance of the
stateful MapReduce were provided.

2. In our current implementation we optimized the scheduling of reduce tasks. However, it is more challenging to
consider the states when scheduling map tasks. Map
task scheduling is based on avoiding loading map input from a remote machine so, loading task state also
should be considered when deciding on the machine on
which each map task should be run.

We believe that what makes the idea of stateful MapReduce

interestingly different can be summarized in the following
points:
Unlike Pregel[11] and CBP[10], we did not have to rebuild a totally new framework to support a certain application. As shown in section 4, stateful MapReduce
can easily be integrated into Hadoop which makes it
more attractive to users than relying on new frameworks. Moreover, stateful MapReduce can benefit from
all the improvements that are being proposed to be
added to Hadoop. Additionally, relying on a single
framework (Hadoop) to carry out all the cloud analytics tasks is easier for cluster administrators and operators.
Stateful MapReduce is more general than Haloop[4]
which application is limited to iterative algorithms.
Proving users with a full control to the states makes
stateful MapReduce suits a wider range of application
than just the iterative ones.

3. Currently, the local copy of each state is maintained in

the memory of the slave machines. Users directly access and update that in-memory copy. However, larger
states might not fit into a commodity machiness memory. So, a local on-disk database engine might be need
to allow for maintaining task states while the task is
being executed.

CONCLUSION

In this project, a simple modification to MapReduce API

was envisioned to be beneficial in many ways. Stateful
MapReduce widens the range of the applications that can
easily be written using MapReduce eliminating the need
to build specific APIs for those applications. It also saves
a significant amount of computations and communication
which improve the performance of several MapReduce applications. Efficient design to extend Hadoop was provided
and evaluated. Evaluation results indicate the performance
gain that can be achieved using stateful MapReduce.

REFERENCES

[1] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi,

A. Silberschatz, and A. Rasin. Hadoopdb: an
architectural hybrid of mapreduce and dbms
technologies for analytical workloads. Proc. VLDB
Endow., 2(1):922933, aug 2009.
[2] P. Agrawal, D. Kifer, and C. Olston. Scheduling
shared scans of large data files. Proc. VLDB Endow.,
1(1):958969, Aug. 2008.
[3] M. Arlitt and T. Jin. 1998 world cup web site access
logs. http://www.acm.org/sigcomm/ITA/, August
1998.
[4] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst.
Haloop: efficient iterative data processing on large
clusters. Proc. VLDB Endow., 3(1-2):285296, Sept.
2010.
[5] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein,
K. Elmeleegy, and R. Sears. Mapreduce online.
Technical Report UCB/EECS-2009-136, EECS
Department, University of California, Berkeley, Oct
2009.
[6] J. Dean and S. Ghemawat. Mapreduce: simplified
data processing on large clusters. Commun. ACM,
51(1):107113, Jan. 2008.
[7] Apache hadoop. http://hadoop.apache.org/.
[8] H. Herodotou and S. Babu. Profiling, what-if analysis,
and cost-based optimization of mapreduce programs.
PVLDB, pages 11111122, 2011.
[9] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W.
Mahoney. Community structure in large networks:
Natural cluster sizes and the absence of large
well-defined clusters. CoRR, abs/0810.1355, 2008.
[10] D. Logothetis, C. Olston, B. Reed, K. C. Webb, and
K. Yocum. Stateful bulk processing for incremental
analytics. In Proceedings of the 1st ACM symposium
on Cloud computing, SoCC 10, pages 5162, New
York, NY, USA, 2010. ACM.
[11] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,
I. Horn, N. Leiser, and G. Czajkowski. Pregel: a
system for large-scale graph processing - abstract. In
Proceedings of the 28th ACM symposium on Principles
of distributed computing, PODC 09, pages 66, New
York, NY, USA, 2009. ACM.
[12] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and
N. Koudas. Mrshare: sharing across multiple queries
in mapreduce. Proc. VLDB Endow., 3(1-2):494505,
Sept. 2010.

E-Commerce Website With PHP and Mysql
100% (3)
E-Commerce Website With PHP and Mysql
68 pages
Nagios
No ratings yet
Nagios
22 pages
MapReduce Online
No ratings yet
MapReduce Online
15 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
CC UNIT-7
No ratings yet
CC UNIT-7
16 pages
Survey Paper On Traditional Hadoop and Pipelined Map Reduce: Dhole Poonam B, Gunjal Baisa L
No ratings yet
Survey Paper On Traditional Hadoop and Pipelined Map Reduce: Dhole Poonam B, Gunjal Baisa L
5 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Adaptive Processing of User-Defined Aggregates in Jaql: Andrey Balmin Vuk Ercegovac Rares Vernica Kevin Beyer
No ratings yet
Adaptive Processing of User-Defined Aggregates in Jaql: Andrey Balmin Vuk Ercegovac Rares Vernica Kevin Beyer
8 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Parameterized Pipelined Map Reduce Based Approach For Performance Improvement of Parallel Programming Model
No ratings yet
Parameterized Pipelined Map Reduce Based Approach For Performance Improvement of Parallel Programming Model
5 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Mapreduce Introduction
No ratings yet
Mapreduce Introduction
14 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Low-Latency, High-Throughput Access To Static Global Resources Within The Hadoop Framework
No ratings yet
Low-Latency, High-Throughput Access To Static Global Resources Within The Hadoop Framework
15 pages
HaLoop - Efficient Iterative Data Processing On Large Clusters
No ratings yet
HaLoop - Efficient Iterative Data Processing On Large Clusters
12 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Map Reduce 1
No ratings yet
Map Reduce 1
50 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
A Brief On MapReduce Performance
No ratings yet
A Brief On MapReduce Performance
6 pages
Unit III EBDP 2022 PDF
No ratings yet
Unit III EBDP 2022 PDF
77 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
UNIT – III
No ratings yet
UNIT – III
38 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Analyzing_Data_with_Hadoop
No ratings yet
Analyzing_Data_with_Hadoop
54 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
Unit 3 & 4 big data
No ratings yet
Unit 3 & 4 big data
18 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Hadoop Security PDF
No ratings yet
Hadoop Security PDF
2 pages
Map reduce
No ratings yet
Map reduce
35 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
BD - Unit - III - MapReduce
100% (1)
BD - Unit - III - MapReduce
31 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Hadoop - Mapreduce (1)
No ratings yet
Hadoop - Mapreduce (1)
5 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Big Data notes (1)
No ratings yet
Big Data notes (1)
13 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Unit 5
No ratings yet
Unit 5
7 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Mapreduce article review
No ratings yet
Mapreduce article review
8 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Christian View of Jewish Temple
No ratings yet
Christian View of Jewish Temple
66 pages
DigitalRoadmap Web PDF
No ratings yet
DigitalRoadmap Web PDF
11 pages
SR 71 Blackbird Owners Manual PDF
No ratings yet
SR 71 Blackbird Owners Manual PDF
4 pages
Subha Shaam Lun
No ratings yet
Subha Shaam Lun
3 pages
Insomnidroid Reverse Engineering
No ratings yet
Insomnidroid Reverse Engineering
69 pages
Oakland 10
No ratings yet
Oakland 10
15 pages
debit card management system
No ratings yet
debit card management system
9 pages
Case Study On Windows Os
No ratings yet
Case Study On Windows Os
19 pages
Research Alternatives: Javascript Mini-Projects Language Learning Game
No ratings yet
Research Alternatives: Javascript Mini-Projects Language Learning Game
65 pages
Jeff - Heaton - JSTL - JSP - Standard - Tag - Library PDF
No ratings yet
Jeff - Heaton - JSTL - JSP - Standard - Tag - Library PDF
432 pages
BCS - SEM II - OS Unit II
No ratings yet
BCS - SEM II - OS Unit II
12 pages
TOKENS
No ratings yet
TOKENS
9 pages
Template Data Retention Policy
No ratings yet
Template Data Retention Policy
49 pages
Top Linux Interview Questions and Answers (2025)
No ratings yet
Top Linux Interview Questions and Answers (2025)
1 page
Signiwis.docx (1)
No ratings yet
Signiwis.docx (1)
27 pages
CS6202 P&DS
No ratings yet
CS6202 P&DS
7 pages
Synopsis
No ratings yet
Synopsis
17 pages
2.1.2 Input Data Into Computer
No ratings yet
2.1.2 Input Data Into Computer
49 pages
An Empirical Study of The Evolution of PHP MVC Framework
No ratings yet
An Empirical Study of The Evolution of PHP MVC Framework
9 pages
In Today's Lab We Will Design and Implement The List ADT Where The Items in The List Are Sorted
No ratings yet
In Today's Lab We Will Design and Implement The List ADT Where The Items in The List Are Sorted
2 pages
ESQL
No ratings yet
ESQL
209 pages
Web Programming: Unit 2: Introduction To PHP
No ratings yet
Web Programming: Unit 2: Introduction To PHP
15 pages
TCB History and Development Architectures
No ratings yet
TCB History and Development Architectures
37 pages
Social Networks 3.3 Worksheet Chandigarh University
No ratings yet
Social Networks 3.3 Worksheet Chandigarh University
14 pages
(Ebooks PDF) Download The Complete Rust Programming Reference Guide Rahul Sharma Full Chapters
100% (3)
(Ebooks PDF) Download The Complete Rust Programming Reference Guide Rahul Sharma Full Chapters
62 pages
LESSON 4 ITEC 80 - Theories - Principles Related To Human Activity - System Design
No ratings yet
LESSON 4 ITEC 80 - Theories - Principles Related To Human Activity - System Design
30 pages
cs115_style_guide
No ratings yet
cs115_style_guide
20 pages
12.5 - The Virtual Table - Learn C++
No ratings yet
12.5 - The Virtual Table - Learn C++
15 pages
Innovative University Election System
No ratings yet
Innovative University Election System
2 pages
Advanced Java Programming
No ratings yet
Advanced Java Programming
129 pages
B.tech (Computer Science) 6th Sem Syallabus MDU
No ratings yet
B.tech (Computer Science) 6th Sem Syallabus MDU
12 pages
Assembly Language For Intel-Based Computers, 4 Edition
No ratings yet
Assembly Language For Intel-Based Computers, 4 Edition
48 pages
Online Banking Using Oosd
89% (9)
Online Banking Using Oosd
43 pages
SBCL Manual
No ratings yet
SBCL Manual
164 pages

Stateful MapReduce PDF

Uploaded by

Stateful MapReduce PDF

Uploaded by

Stateful MapReduce

map function is invoked for each key-value pair of the input

MapReduce[6] was presented as a framework to facilitate

envision that by only adding and maintaining states across

STATEFUL MAPREDUCE API

StatefulJob job = new StatefulJob ()

In this sections examples on applications for which Stateful

Online/Incremental Analytics: Sessionization

In online/incremental analytics, users have datasources that

reduce(UserId, <List> Clicks, State)

Iterative Algorithms: PageRank

Iterative Algorithms can be executed on top of Hadoop by

is carried out using two MapReduce Jobs. Using stateful

reduce(hyperlink, <List> rankPortions, State)

Graph Processing: Shortest Path

The authors of [11] argued that MapReduce is not suitable

map(nodeID, minDistance, State)

Key: <jobName, M/R, IdWithinJob>

Figure 1: The Modified Hadoop Architecture to

reduce(nodeID, <List> distances, State)

DESIGN AND IMPLEMENTATION

At each TaskTracker an in-memory table is used to store the

In this section, the experimental evaluation of the stateful

task along with the previous TaskTracker on which the task

4(5.32) 5(6.65) 6(7.98) 7(9.31)

8(10.64) 9(11.97) 10(13.3)

Figure 2: Comparing the Running Time of Stateful

Overhead of Storing a Persistent Copy of each Task State

Average Latency (Sec)

Average State Size (KB)

Average Latency (Sec)

Figure 3: The Average Incurred Latencies Caused

1. We need to assess the latency incurred when using

This paper started with defining the stateful MapReduce

We believe that what makes the idea of stateful MapReduce

3. Currently, the local copy of each state is maintained in

In this project, a simple modification to MapReduce API

[1] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi,

You might also like