cme323_lec2
cme323_lec2
http://stanford.edu/~rezab/dao.
Instructor: Reza Zadeh, Matroid and Stanford.
T1,n
Definition 2.2 (Weakly Scalable) If SpeedUp(p,np) = Tp,np = Ω(1), then our algorithm is
weakly scalable.
This metric characterizes the case where, for each processor we add, we add more data as well.
This is a useful metric in practice, because oftentimes the only time we can afford to add more
processors or machines is when we are burdened with more data than our infrastructure can handle.
Definition 2.3 (Embarassingly Parallel) When the DAG representing an algorithm has 0-depth,
the algorithms is said to be embarrassingly parallel
That is, there is no dependency between our operations. It’s scalable in the most trivial sense,
e.g. flipping as many coins as possible at the same time.
We note here that we have used Brent’s theorem to derive the scaling bounds for the parallel
sum algorithm. In the previous lecture, we alluded to the fact that Brent’s theorem assumes
optimal scheduling, which is NP-hard. Fortunately, the existence of a polynomial time constant
approximation algorithm for optimal scheduling implies that these bounds still hold.
1
Figure 1: An Embarrassingly Parallel DAG
2.2 Scheduling
In addition to building algorithms with low depth, clever scheduling is just as important to par-
allelism. Given a DAG of computations, at any level in the DAG there are a certain number of
computations which can be required to execute (at the same time). The number of computations is
not necessarily equal to the number of processors you have available to you, so you need to decide
how to assign computations to processor—this is what is referred to as scheduling.
Ideally, you wish for all your processors to be busy, however, depending how jobs are assigned
to processors, you might end up with processors that are idle and not working. Depending on the
size and dependencies of jobs to be scheduled, it may not be possible for all processors to be busy
all the time. This then turns into an optimization problem where we try to schedule jobs in a way
that minimizes the idle time of processors. This problem turns out to be NP hard.
It is the scheduler’s task to schedule things in tandem in such a way minimizes the idle time
of processors. We could do this greedily, i.e., as soon as there is any computation to be done, we
assign it to a processor. Or, we can look ahead in our DAG to see if we can plan more efficiently.
Spark has a scheduler. Every distributed computing set up has a scheduler. Your operating
system and phone’s have schedulers. Every computer has processes, and every computer runs in
parallel. Your computer might have fifty Chrome tabs open and must decide which one to give
priority to in order to optimize performance of your machine.
An important problem in any parallel or distributed computing setting is figuring out how to
schedule jobs optimally. I.e. a scheduler must be able to assign sequential computation to processors
or machines in order to minimize the total time necessary to process all jobs.
Notation We assume that the processors are identical (i.e. each job takes the same amount of
time to run on any of the machines). More formally, we are given p processors and an unordered set
of n jobs with processing times J1 , . . . , Jn ∈ R. Say that the final schedule for processor i is defined
by a set of indices of jobs assigned to processor i. We call this set Si . The load for processor i is
P
therefore, Li = k∈Si Jk . The goal is to minimize the makespan defined as Lmax = maxi∈{1,...,p} Li .
The intuition behind the greedy algorithm discussed here is simple: in order to minimize the
makespan we don’t want to give a job to a machine that already has a large load. Therefore, we
consider the following algorithm. Take the jobs one by one and assign each job to the processor
that has the least load at that time. This algorithm is simple and is online.
2
1 for each job that comes in (streaming) do
2 Assign job to lowest burdened machine
3 end
Algorithm 1: Simple scheduler
Other variants of scheduling We note there are many other variants of scheduling. Jobs can
have dependencies, i.e. one job must finish before another job can start. Here, the problem is pre-
specified by a computational DAG that is known before the time of scheduling. Another variant
is that scheduling must happen online, i.e. jobs come at you in an order where you cannot look
into the future. As jobs come in, you have to schedule it, and you cannot go back and change the
schedule. For a comprehensive treatment of variants of scheduling, see Handbook of Scheduling.1
In either of the above cases, where jobs have dependencies or must be scheduled online, the problem
is NP hard. So, we use approximation algorithms. We claim that the simple (greedy, and online)
algorithm actually has an approximation ratio of 2. In other words, the algorithm is in the worst-
case 2 times worse than the optimal, which is fairly good. For this analysis, we define the optimal
makespan (minimal makespan possible) to be OPT and try to compare the output of the greedy
algorithm to this. We also define Lmax as above to be the makespan.
Claim: Greedy algorithm has an approximation ratio of 2.
Proof: We first want to get a handle (lower bound) on OPT. We know that the optimal makespan
must be at least the sum of the processing times for the jobs divided amongst the p processors, i.e.2
n
1X
OPT ≥ Ji . (1)
p
i=1
A second lower bound is that OPT is at least as large as the time of the longest job, 3
Now consider running the greedy algorithm and identifying the processor responsible for the
makespan of the greedy algorithm (i.e. k = argmaxi Li ). Let Jt be the load of the last job placed
1
To give an idea of another variant, consider the case of distributed computing, where each machine houses a set
of local data, and shuffling data across the network is a bottleneck. We may consider scheduling jobs to machines
such that no data are shuffled. We’ll consider this more in the latter part of the course. This is called locality sensitive
scheduling
2
To see this, assume toward contradiction that OPT is able to schedule jobs such that OPT < p1 n
P
i=1 Ji . Suppose
that instead of OPT assigning jobs to p processors in parallel, we assigned all the work to one processor sequentially.
Then of course the total time required given by p · OPT < n
P Pn
i=1 Ji but this is a contradiction, since i=1 Ji exactly
represents the amount of work required to process all n jobs on a single processor.
3
The reason for this is simple: the longest job must be scheduled at some point to be run sequentially on one
processor, at which point it will require maxi Ji time to compute. There may be other processors which bottleneck
our makespan, but we know that OPT must take at least as long as any job, and in particular this holds for the
largest job.
3
on this processor. Before the last job was placed on this processor, the load of this processor was
thus Lmax - Jt . By the definition of the greedy algorithm, this processor must have also had the
least load before the last job was assigned. Therefore, all other processors at this time must have
had load at least Lmax − Jt , i.e. Lmax − Jt ≤ L0i for all i. Hence, summing this inequality over all i,
p
X p
X n
X
p(Lmax − Jt ) ≤ L0i ≤ Li = Ji (3)
i=1 i=1 i=1
In the second inequality, we assert that although Jt the last job placed on the bottleneck processor,
there may still be other jobs yet to be assigned, hence we have that the sum of total work placed on
each machine cannot decrease after assigning all jobs. The last equality comes from the fact that
the sum of the loads must be equal to the sum of the processing times for the jobs. Rearranging
the terms in this expression let’s us express this as:
n
1X
Lmax ≤ Ji + Jt (4)
p
i=1
Now, note that our greedy algorithm actually has makespan exactly equal to Lmax , by definition.
Using equations 1 and 2 along with the fact that Jt ≤ maxi Ji , we get that our greedy approximation
algorithm has makespan
APX = Lmax ≤ OP T + OP T = 2 × OP T. (5)
This shows that the greedy algorithm provides us with a scheduling time that is not more than 2
times more than the optimal.
What if we could see the future? We note that if we first sort the jobs in descending order
and assign larger jobs first, we can naively get a 3/2 approximation. The intuition is that if we first
schedule large jobs, we can use the smaller jobs to “fill in the gaps” remaining, i.e. to balance all
loads. If we use the same algorithm with a tighter analysis, we get a 4/3 approximation. We’ll see
later in the course how Spark uses lazy evaluation for exactly this reason: by faking computation
until the user takes an action, Spark can sort jobs and to obtain a more efficient scheduler.
What’s realistic? It may seem that our above assumption, to be able to look into the future and
know which jobs are going to be scheduled, is quite unreasonable. In reality, we don’t even know
how long each job will take. However, we often have a pretty good idea (based on historical data
or expectations) how long a particular job will take to run. And further, we may have statistics
or expectations on how many jobs of a particular type are going to come in, hence, it may not be
such an unrealistic scenario to know (within a certain tolerance) the expected amount of time each
job will take as well as what jobs might be in the pipeline.
4
As an example, suppose A = [3, 5, 3, 1, 6], then AllPrefixSum(A) = [0, 3, 8, 11, 12, 18].
This feels like an inherently sequential task. The obvious way to do this with a single machine is
to have a running sum and write intermediary sums as we iterate through the array in linear time.
However, this does not parallelize at all. How can we parallelize this, so that it has low-depth?
We’ll take a look at this problem in more detail next lecture. For now, try to come up with your
own algorithm.
References
[1] Ola Svensson Approximation Algorithms. EPFL, January 21, 2013.