Lec 6
Lec 6
Lec 06-Hadoop
MapReduce 2.0
(Part-I)
Hadoop, maps reduce 2.0 versions.
Content of this lecture in this lecture; we will discuss MapReduce paradigm and also what is new in
MapReduce version 2? We will also look into the internal working and the implementation or we will
also see many examples how using MapReduce, we can design different applications and also look into
how the scheduling and the fault tolerance is now being supported inside the MapReduce implementation
of version 2.
So, the programs, are written in this style are automatically, are executed in parallel on a large cluster of
commodity machines. So, the parallel programming paradigm is automatically being taken care and the
users or the programmers, are not going to be bothered about, how the parallelization and how the
synchronization, is to be managed. So, the so, parallel execution is automatically done in, the MapReduce
framework. Now as far, as the run time is concerned, run time take care of the details of partitioning the
data scheduling the programs like equation across the set of machines, handling machine failures,
managing the required inter machine communication and this, allows the programmer, without any
experience with the parallel and distributed computing to easily utilize the resources of the large
distributed system. So, let us explain, this entire notion of a runtime operations which this particular
MapReduce allows or provides to the execution environment, to the for the execution of these programs
on a large-scale commodity machines. So, as far as MapReduce is concerned now, starting with the data
set, the input data set, we assume to be a very large and stored on the clusters, that means may be stored
on hundreds and thousands of nodes one data set let us assume that it is stored. So, this will be stored with
the help of the partitioning. So, the data set will be partitioned and stored on these nodes. So, let us say
that if 100 nodes, are available the entire data set is partitioned into 100 different chunks and stored, in
this way, for computation in this scenario. Now, the next task is, to schedule the program's execution,
across these, let us say 100 different machines where the entire data set is stored. So, then, it launches the,
the programs in the form of a map and reduce, we will execute, in parallel, at all places ,that is let us say
if hundred different nodes are involved, all hundred different chunks, will experience the execution in
parallel. So, this is also, to be done automatically by, the MapReduce using, the job tracker and a task
tracker, that we have already seen. And in the newer version that is 2.0, this particular scheduling and
resource allocation, this is to be done by the YARN, once this particular request comes from, this
MapReduce execution, for that particular job, of that data set. Now, there is a issues like, these nodes, are
comprises of commodity, Hardware that we have seen in the previous lecture of HDFS. So the nodes may
fail, in case of node failures, automatically ,it will, be taken care by doing the scheduling of this
alternative replicas and the execution still carries on without any effect on the execution. so it requires,
also to manage the inter machine communications, for example ,the intermediate results, when it is
available ,then, they are sometimes required to communicate with each other ,some values, some results,
this is called,’ shuffle’ and ‘combined’. So, shuffle and combine requires, intermission communication,
that we will see, in more detail in the further slides. Now, this will allow the programmers without any
experience, with the parallel and distributed systems, to use it. And this will simplify, the entire
programming, paradigm and the programmers has to only write down a simple MapReduce programs. so
we will see how the programmers will be able to write the MapReduce program for different application
and the execution will be automatically, taken care by, this framework. So a typical, MapReduce
computation may vary from terabytes on thousands of the machines. So that we have, already
seen ,hundreds of MapReduce programs have been implemented and upwards of 1,000 MapReduce jobs
are executed on Google's, cluster, every day. So, the companies like Google and other big companies,
which works, in the big data computation they uses this technology, which is the MapReduce programs.
That means applications are written in the MapReduce that we are going to see them.
Now, we will see about, this MapReduce and its motivation why we are? So, much enthusiastic, about
this more produced programming paradigm and which is very much used in the Big Data computation.
So, the motivation for MapReduce is, to support the large-scale data processing. So, if you want, to run
the program on thousands of CPUs and then this MapReduce is the only paradigm available and all the
companies, which are looking for the scale? A large-scale data processing the MapReduce is the only
framework, which is basically doing all this task? And another important motivation is that, this particular
paradigm makes the programming very simple and does not require the programmer to go and manage
the intricacies of the menu detail, underneath. Now MapReduce architecture also provides automatic
parallelization and the distribution, of data and the configuration of it. So, the parallel computation, on
distributed system is all abstracted and the programmer is provided a simple MapReduce paradigm and he
does not have, to bother on these details automatically, the parallelization and distributed computation, is
being performed another important part is the failures. So, the fault tolerance is also supported in the
MapReduce architecture and the programmer does not have to bother about, that automatically it has
taken care, provided the programmer gives sufficient, configuration information so that this fault
tolerance is taken care of in different applications, another thing is input-output scheduling, is
automatically done. So, those lots of optimizations are used to reduce the number of I/O, operations and
also improve upon the performance of the execution engines. So, that is all automatically done by the
MapReduce architecture. Similarly the monitoring of all the data nodes, that is the task, that is the task
records, execution and their status updates is all, done using the client-server interaction in the
MapReduce architecture, that is also taken care of.
Now, let us go in more detail of the MapReduce paradigm. So, let us see, what this MapReduce? Is from
programmers perspective, this particular map and reduce terms is borrowed from the functional
programming language, with let us say that Lisp, is one such programming language, which has this kind
of features, that is MapReduce. So, let us see the functionality, which is supported in the functional
programming language of MapReduce? And let us say that, we want to; write a program for calculating
the sum of squares of a given list. So, let us assume that the list which we are given, is one, two, three,
four is a list of numbers and we want to find out, the squares of these numbers. So, there is a function
which is called a, ‘Square’. And a map says that this particular square function is to be applied on each
and every element of the list. So, let us see the output will be:
(map square ‘(1 2 3 4))
Output: (1 4 9 16)
[processes each record sequentially and independently]
So, all the squares are computed and output is given in again in the form of a list. Now, this particular
operation is square may be executed in parallel, at all the elements and this result will be given very
efficiently ,it is not that sequentially one by one, all the squares are being computed. So, therefore it will,
process them independently instead of these records are not to be carried out in sequentially hence, it's a
parallel execution, environment of map function ,of a map function that is, performing the square
operation on all the elements, which are being provided in the map function. Similarly to find out the sum
this was to calculate the, the squares, now another routine which is required to make the summation, of
this intermediate result. So, this result is called, ‘Intermediate Result’. So, with this input, we have to
perform another operation which is called a, ‘Reduce’. Reduce will require an addition operation, the
addition on this particular list, which is nothing but, an intermediate result which is calculated by the map
function and you can see that, the sum is the binary operator. So, it will start doing the summation, 2
elements at a time and once they are calculated. So, the entire output will be given in the form of
whatever is required sum of the squares. So, this is the sum of squares.
So, given this kind of functionality or a function of a functional programming language, we will see that,
how this particular MapReduce? Can be applied here, in the Hadoop scenario. Now, let us see a small
application, simple application, here it, is written a sample application of a word count. So, that means, let
us say we have a large collection of documents, let us assume it is a Wikipedia dump, of particular
Shakespeare's, works it can be any, different important persons related works, we are going to process this
huge data set and we are asked to find out, to list the count for each of, the words in this one document
and then query will be to find out a particular reference, in the Shakespeare's works about, the characters
how many? Which character is more important based on how much referencing? Shakespeare has done
for that character.
Now, we will see the next phase that is called, ‘Reduced Operation’. So, after the map the intermediate
values, which is being generated, in the form of the key value pair will be used as the input for the
reduced function. So, reduce processes, these intermediate values, which is output by the map function
and then merges all the intermediate values, associated per key, that means based on the similarity of the
key values, they are all combined together that is group by key operation, is being performed and using,
in this particular, group by key operation whatever is the reduced function applies, this will be the
computation will come, from the reduced function. So, computation on this group by key, will be applied
by, the reduce function to get the entire output, for example if this is, the input which is, input to the
reduced function and we want to do a word count. So, now what it will do is, it will group by the key
when you say group by the key then, everyone is appearing twice. So, with everyone there are two
different values or two times, this everyone one will come and this will be added up by the reduced
function, similarly hello one is appearing once. So, group by key is, having only this tuple or entity
similarly for welcome. So, this particular reduced function, will execute in this particular manner.
So, each key assigned to one reduce function and here we can say that if there are two different nodes,
which runs this reduce task, let us say task 1 and reduce task 2. So, each key is, assigned to one reduce
task for example this, particular key will be assigned to the task 1. So, when this second key also second
time when everyone comes it should also, assign to that same reduce task, whereas welcome and hello
can be assigned to the reduce task 2. Now, parallel process and merges all, the intermediate value by
partitioning keys that, we have already told you, but how, these keys are to be partitioned and the simple
example is, let us, say that use a, hash partitioning, that means if you have a hashing function f, and when
you apply this has have a hashing function, on a particular keyword, it will give a particular value and this
value will be mod, how many number of reduced tasks here? Let us, assume that it is, the number of tasks
is 2 and modulo, some hash function and on that particular hash function, we have to define a key and this
will generate the, the partitioning. So, partitioning we have shown you through, a simple example, the
way partitioning is carried out, that is through the highest partitioning, it can be very complicated also
depending upon different application how ,the partitioning is done and how many different worker nodes
are reduced nodes are there or reduce task is allocated by the YARN? And based on that this particular
partitioning is done and after partitioning then it will merge, the intermediate values and gives, the result
back.
Programming model, the computation takes a set of input key value pairs and produces a set of output key
value pairs, in this programming model. To do this, the user has to specify using MapReduce library the
two functions, the map and reduce function, which we have already explained.