Improvements in Gang Scheduling For Parallel Supercomputers
Improvements in Gang Scheduling For Parallel Supercomputers
[email protected], flcampos,[email protected]
1
Laboratoire ASIM, LIP6, Universite Pierre et Marie Curie, Paris, France.
2 Dept. of Information and Comp. Science, University of California, Irvine, CA 92697, U.S.A
P0 J1 J2 J4
P2 J1 J2 J4 J5
Gang
P3 J1 J2 J4 J5
P4 J1 J3 J4 J6
11
00 11
00 11
00 11
00 11
00
00
11 00
11 00
11 00
11 00
11
00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11
able resources can be allocated to other eligible Period Period Period Period
inally dened by Feitelson [3]. Figure 1: Cycle, slice, period and slot denitions
To clarify the application of these policies in
Concurrent Gang let us rst state some impor-
tant concepts. These are the concepts of cycle, change again occurs. In the event of a workload
slice, period and slot. Figure 2 illustrates these change, the distribution of jobs in the machine
denitions. A Workload change occurs at the ar- is reorganized depending on the change in the
rival of a new job, the termination of an existing workload, and as we have a queue of jobs, some
one, or through the variation of the number of thread migration may occur because of this reor-
eligible threads of a job to be scheduled. The ganization. We will refer to this strategy hence-
time between workload changes is dened as a forth simply as rst t.
cycle. Between workload changes, Concurrent Although we dened an algorithm where
Gang scheduling is periodic, with a period that thread migration was possible, if the machine
is a function of the workload and the spatial al- under consideration has no ecient mechanism
location. A period is composed of slices; a slice for thread migration, algorithms with no thread
corresponds to a time slice as in gang schedul- migration are also possible using these concepts.
ing, with the dierence that in Concurrent Gang A very simple policy for spatial sharing un-
we may have more than one job simultaneously der Concurrent Gang without thread migration
scheduled in a slice. A slot is the processors' is the greedy one. At arrival, a job is scheduled
view of a slice. A Slice is composed of N slots, in a slice that has sucient idle slots to accom-
for a machine with N processors. If a proces- modate the arriving job. In this case the de-
sor has no assigned thread during its slot in a nitions of cycle, slice, etc. would also be valid.
slice, then we have an idle slot. The bidimen- The scheduler should maintain a list of idle slots
sional diagram showed in gure 3 is inherent to in the period in order to know, at job arrival, if
the concurrent gang algorithm, and it is used to it is possible to schedule the job in an already
dene the spatial allocation strategy. We refer existing slice.
to this diagram as the trace diagram. It is worth noting that, relative to its def-
The implementation of Concurrent Gang with inition as a queueing network with processor
rst t with thread migration is a rst example sharing discipline, Concurrent Gang is particu-
of a Concurrent Gang scheduler. It is based on a larly convenient to describe schedulers that are
greedy algorithm applied at the time of a work- periodic between workload changes. We will
load change. During the cycle, the workload is now state a theorem that proves that a periodic
obviously assumed constant. Thus, the eligible schedule performs at least as well as any non pe-
threads of queued jobs are allocated to proces- riodic one with respect to the total number of
sors using the rst t strategy for each slice. idle slots, i.e., periodic schedulers achieves bet-
Clearly, after all eligible threads are scheduled on ter spatial allocation than (or at least as good
a processor for some slice (slot), the temporal se- as) non-periodic ones when processor utilization
quence is repeated periodically until a workload is measured through the ratio of total number of
P2-H-3
empty (idle) slots to the total number of slots in the end of each period, all the threads belong-
the period. We denote this measure as the idling ing to the same job have made equal progress.
ratio. Therefore, no two threads lag behind another
thread of the same job by more than a constant
Theorem 1 Given a workload W, for every number of slices.
temporal schedule S there exists a periodic sched- Secondly, observe that it is possible to choose
ule S such that the idling ratio of S is at most a time interval [ k 0k ] such that the happiness
ti ; ti
that of S,
as much as in the complete trace diagram.This
Proof - First of all, let's make a denition that implies that the happiness of each job in the
will be useful in this proof. We dene here job constructed periodic schedule is greater than or
happiness in a interval of time as the number equal to the happiness of each job in the original
of slots allocated to a job divided by the total temporal schedule.
number of slots in the interval. Therefore, the idling ratio of the constructed
Dene the progress of a job at a particular periodic schedule must be less than or equal to
time as the number of slices granted to each of the idling ration of the original temporal sched-
its threads up to that time. Thus, if a job has ule. Since the fraction of area in the trace di-
V threads, its progress at slice t may be rep- agram covered by each job increases, the frac-
resented by a progress vector of V components, tion covered by the idle slots must necessarily
where each component is an integer less than decrease. This concludes the proof.
or equal to t. By the rules of legal execution,
no thread may lag behind another thread of the
same job by more than a constant C number of
4 Simulation and Verication
slices. Therefore, no two elements in the progress To verify the results above, we used a general
vector can dier by more than C. Dene the dif- purpose event driven simulator, developed by
ferential progress of a job at a particular time as our research group for studying a variety of re-
the number of slices by which each thread leads lated problems (e.g., dynamic scheduling, load
the slowest thread of the job. Thus a dieren- balancing, etc.). The simulator accepts two dif-
tial progress vector at time t is also a vector of V ferent formats for describing jobs. The rst is a
components, where each component is an integer fully qualied DAG. The second is a set of pa-
less than or equal to C. The dierential progress rameters used to describe the job characteristics
vector is obtained by subtracting out the mini- such as computation/communication ratio.
mum component of the progress vector from each When the second form is used the actual com-
component of the progress vector . The system's munication type, timing and pattern are left un-
dierential progress vector (SDPV) at time t is specied and it is up to the simulator to con-
the concatenation of all job's dierential progress vert this user specication into a DAG, using
vectors at time t. The key is to note that the probabilistic distributions, provided by the user,
SDPV can only assume a nite number of val- for each of the parameters. Other parameters
ues. Therefore there exists an innite sequence include the spawning factor for each thread, a
of times t 1 ; t 2 ; ::: such that the SDPVs at these
i i thread life span, synchronization pattern, degree
times are identical. of parallelism (maximum number of threads that
Consider any time interval [t k ; t 0k ]. One may
i i can be executed at any given time), depth of crit-
construct a periodic schedule by cutting out the ical path, etc. Even though probabilistic distri-
portion of the trace diagram between t k e t 0k and
i i butions are used to generate the DAG, the DAG
replicating it innitely in the time dimension. itself behaves in a completely deterministic way.
First of all, we claim that such a periodic Once the input is in the form of a DAG, and
schedule is legal. From the equality of the the module responsible for implementing a par-
SPDVs at t k e t 0k it follows that all threads be-
i i
ticular scheduling heuristics is plugged into the
longing to the same job receive the same number simulator, several experiments can be performed
of slices during each period. In other words, at using the same input by changing some of the pa-
P2-H-4
rameters of the simulation such as the number Gang
of processing elements available, the topology of Total Running Time (%) Total Idle Time (%)
the network, among others, and their outputs, 123.6 41.9
in a variety of formats, are recorded in a le for Concurrent Gang
later visualization. Total Running Time (%) Total Idle Time (%)
For this study we grouped parallel jobs in 100 28.2
classes where each class represents a particu-
lar degree of parallelism (maximum number of Table 1: Experimental results
threads that can be executed at any given time).
The reason behind grouping parallel jobs by random times and at any given instance there
their degree of parallelism is to evaluate the per- might not be any job ready to be scheduled. The
formance of the algorithms being studied across last is a result of ineciencies due to the non
the vast spectrum of real parallel applications optimality of the rst t algorithm.
(ranging from massive parallel to programs re-
quiring only two processing elements) and there-
fore reduce the bias towards a single type of ap-
References
plication. [1] Feitelson, D. G.: Job Scheduling in Multi-
We divided the workload into ten dierent programmed Parallel Systems IBM Research
classes with each class containing 50 dierent Report RC 19970, Second Revision, 1997
jobs. The arrival time of a job is described by
a Poisson random variable with an average rate [2] Feitelson, D. G., Jette, M. A.: Improved
of two job arrivals per time slice. The actual Utilization and Responsiveness with Gang
job selection is done in a round robin fashion by Scheduling Job Scheduling Strategies for
picking one job per class. This way we guaran- Parallel Processing, D. G. Feitelson and L.
tee the interleaving of heavily parallel jobs with Rudolph (eds.), pp. 238-261 Springer Verlag,
shorter ones. 1997.
We distinguish the class of computation in- [3] Feitelson, D. G.: Packing Schemes for
struction and that of communication instruction Gang Scheduling Job Scheduling Strategies
in the various threads that compose a job. The for Parallel Processing, D. G. Feitelson and
latter forces the thread to be suspended until L. Rudolph (eds.), pp. 89-110 Springer Ver-
the communication is concluded. If the com- lag, 1996.
munication is concluded during the currently as-
signed time-slice the thread resumes execution. [4] Feitelson, D. G., Rudolph L., Gang Schedul-
We used a factor of 0.001 communications per ing Performance Benets for Fine Grain Syn-
computation instructions. chronization, Journal of Parallel and Dis-
The classes are ranked according to their de- tributed Computing 16, pp. 306-318, 1992.
gree of parallelism (between 2 and 1024 in pow- [5] Jette, M. A., Performance Characteristics of
ers of two increments) and the jobs were sched- Gang Scheduling in Multiprogrammed Envi-
uled in a simulated 1024 processor machine. In ronments, Supercomputing' 97, 1997.
table 1 we compare gang scheduling with Con-
current Gang using rst t as space sharing [6] Scherson, I. D., Subramaniam R., Reis, V.
strategy. L. M., Campos, L. M.. Scheduling Compu-
It is important to dissect the value obtained tationally Intensive Data Parallel Programs.
for idle time, which is the result of three factors: Ecole Placement Dynamique et Re'partition
1 - Communications de charge pp. 39-61, 1996
2 - Absence of ready threads [7] Silva,F., Campos, L. M., Scherson, I. D.
3 - Inneciency of allocation A Lower Bound for Dynamic Scheduling
The rst is a natural consequence of threads of Data Parallel Programs, EUROPAR' 98
communicating among themselves. The second (1998 - to appear)
re
ects the fact that jobs arrive and nish at
P2-H-5