0% found this document useful (0 votes)
23 views11 pages

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters Via Wise Resource Sharing

Uploaded by

oyo2k9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views11 pages

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters Via Wise Resource Sharing

Uploaded by

oyo2k9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Scheduling Deep Learning Jobs in Multi-Tenant

GPU Clusters via Wise Resource Sharing


Yizhou Luo∗ , Qiang Wang† , Shaohuai Shi† , Jiaxin Lai∗ , Shuhan Qi† , Jiajia Zhang† , Xuan Wang†
Harbin Institute of Technology (Shenzhen)
Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies
∗ {23S151149,200110515}@stu.hit.edu.cn, † {qiang.wang,shaohuais,shuhanqi,zhangjiajia,wangxuan}@hit.edu.cn

Abstract—Deep learning (DL) has demonstrated significant jobs are not scheduled properly [3]. For such an online
arXiv:2407.13088v1 [cs.DC] 18 Jul 2024

success across diverse fields, leading to the construction of scheduling system that concurrently handles a rising number of
dedicated GPU accelerators within GPU clusters for high-quality jobs, flexible resource allocation and efficient job scheduling
training services. Efficient scheduler designs for such clusters are
vital to reduce operational costs and enhance resource utilization. are indispensable to maximize the resource utilization. There
While recent schedulers have shown impressive performance in exist some traditional schedulers [4]–[8] to schedule different
optimizing DL job performance and cluster utilization through computing tasks, but they are not specifically designed for
periodic reallocation or selection of GPU resources, they also en- DDL training jobs and cannot leverage the characteristics of
counter challenges such as preemption and migration overhead, DDL (such as iterativeness and convergence properties) for
along with potential DL accuracy degradation. Nonetheless, few
explore the potential benefits of GPU sharing to improve resource maximal training efficiency.
utilization and reduce job queuing times. Existing DL job management and scheduling systems [9]–
Motivated by these insights, we present a job scheduling model [14] commonly employ preemptive and exclusive strategies to
allowing multiple jobs to share the same set of GPUs without enhance system utilization and minimize job completion time.
altering job training settings. We introduce SJF-BSBF (shortest The advanced heuristic scheduler Tiresias [13] demonstrated
job first with best sharing benefit first), a straightforward yet
effective heuristic scheduling algorithm. SJF-BSBF intelligently that the shortest-remaining-service-first (SRSF) algorithm gen-
selects job pairs for GPU resource sharing and runtime settings erally yields optimal results when job durations are known.
(sub-batch size and scheduling time point) to optimize overall However, small jobs still experience delays waiting for GPU
performance while ensuring DL convergence accuracy through resource release when the cluster is predominantly occupied
gradient accumulation. In experiments with both physical DL by large jobs. The state-of-the-art representative is Pollux [15],
workloads and trace-driven simulations, even as a preemption-
free policy, SJF-BSBF reduces the average job completion time by which dynamically (re-)assigns resources to improve cluster-
27-33% relative to the state-of-the-art preemptive DL schedulers. wide goodput, while respecting fairness and continually opti-
Moreover, SJF-BSBF can wisely determine the optimal resource mizing each DL job to better utilize those resources. However,
sharing settings, such as the sharing time point and sub-batch size Pollux helps users choose the GPU resources as well as tune
for gradient accumulation, outperforming the aggressive GPU the training hyper-parameters, which may result in model
sharing approach (baseline SJF-FFS policy) by up to 17% in
large-scale traces. accuracy degradation [16]. Overall speaking, in preemptive
Index Terms—Distributed Deep Learning, Job Scheduling, and exclusive policies, long-term job packing can exacerbate
Communication Contention HOL (Head-of-line) blocking issues and prolong JCT (Job
Completion Time). Consequently, jobs with small training
I. I NTRODUCTION iterations and low GPU demand may face severe queuing and
The popularity of Deep Neural Network (DNN) [1] grows starvation issues, while those large ones can suffer from high
rapidly in both industry and academia with its significant role migration overhead.
in various applications, such as computer vision and natural In several recent schedulers, including Gandiva [17], Zico
language processing. With more and more training data and [18], Salus [19] and Lucid [20], there has been a notable shift
larger model size, training deep models becomes very time- towards emphasizing resource sharing, particularly regarding
consuming. Distributed Deep Learning (DDL) [2] is widely GPU and network resources. This shift aims to enhance overall
adopted to speed up the training procedure, which distributes resource utilization while addressing queuing and starvation
the training workload to a cluster of workers and exploit the issues effectively. Gandiva [17] introduced GPU time-slicing
parallel computing power to accelerate the training process. and job scheduling based on predicted DDL training job
In the data center scenario where the hardware resources characteristics, albeit with a conservative approach limiting
are shared by multiple users, multiple online DDL training GPU sharing to single-GPU jobs. Yu et al. [21], [22] tackled
jobs are running simultaneously, and the resource contention network resource sharing in multiple all-reduce based DDL
could lead to severe performance degradation if the training job training, reducing communication contention overhead.
Lucid [20] utilized an indolent packing strategy to mitigate
Corresponding authors: Qiang Wang, Shaohuai Shi interference effectively. However, their search was confined to
a limited solution space due to the inability to alter the training Preemptive and Exclusive Schedulers. These schedulers
batch size. possess the capability to interrupt or preempt a running
On the contrary, gradient accumulation has become standard job in order to allocate exclusive resources to another job
feature in many deep learning training frameworks [23]. The with higher priority. This mechanism ensures that allocated
basic idea of gradient accumulation is to accumulate gradients resources remain inaccessible to other jobs while the current
from multiple micro-batches and only then update the model job is utilizing them, thereby fostering predictable resource
parameters. This is particularly helpful in training very large usage patterns and mitigating interference between jobs. Early
neural networks [24], where workers can only fit one small works such as Optimus [12] and Cynthia [25] relied on job
micro-batch at a given time, saving GPU memory footprint time prediction, making simplistic assumptions about train-
requirement. From an optimization perspective, gradient ac- ing convergence curves. Tiresias [13] addressed the severe
cumulation is completely equivalent to training with a larger resource starvation issue by proposing adaptive scheduling
mini-batch size, since in both cases the gradient is averaged algorithms with effective job migration strategies. Other stud-
with respect to all computed examples. ies, such as Harmony [26] and Spear [27], leveraged deep
Motivated by the observations outlined above, we introduce reinforcement learning to provide efficient solutions aimed
a job scheduling model that enables multiple jobs to run at minimizing average job completion time or makespans.
concurrently on one or more GPUs. In contrast to the approach Another line of research in DDL job scheduling algorithms
of scaling the training batch size and tuning training hyper- relies on theoretical formulation and optimization, treating
parameters as in [15], we investigate the potential of GPU DDL job scheduling as constrained optimization problems.
sharing to improve the overall performance. This model is The recent state-of-the-art Pollux [15] dynamically reallocates
coupled with gradient accumulation to address GPU memory resources to enhance cluster-wide throughput while ensuring
limitations and ensure model convergence. The contributions fairness and continually optimizing each DL job to maximize
of this paper can be summarized as follows: resource utilization. However, these methods cannot guarantee
• We introduce a novel DDL job scheduling model enabling no accuracy degradation for all models. Moreover, they may
multiple jobs to fully or partially share the same set of encounter performance degradation due to migration [13] and
GPUs while ensuring model convergence through gradi- GPU under-utilization [28].
ent accumulation. Unlike existing methods that increase Non-preemptive Schedulers. Early non-preemptive sched-
batch size and GPU numbers to enhance performance, ulers predominantly relied on heuristic algorithms based on
risking accuracy degradation, our model focuses on GPU job characterization and hardware performance modeling. In
resource sharing across jobs and mitigates GPU memory recent studies, attention has shifted towards resource sharing,
constraints through gradient accumulation, thereby poten- encompassing GPU and network resources, which holds sig-
tially reducing queuing time for waiting DDL jobs. nificant potential for improving computing resource utilization
• We propose SJF-BSBF (shortest job first with best shar- and alleviating starvation. Gandiva [17] introduced GPU time-
ing benefit first), a straightforward yet effective schedul- slicing and job scheduling by predicting DDL training job
ing algorithm for the aforementioned problem. Initially, characteristics. However, it adopted a conservative approach,
we derive the optimal solution for scheduling a job pair limiting GPU sharing to single-GPU jobs. Zico [18] focused
(one ongoing job and one new arrival) to decide GPU on system-wide memory consumption for concurrent training
sharing feasibility and launch timing. Subsequently, we and devised a feasible memory management solution to ensure
employ a greedy strategy to determine batch size and that concurrent jobs do not exceed the allocated memory
GPU allocation, minimizing interference with existing budget. Wang et al. [29] and Yu et al. [21], [22] addressed net-
jobs and reducing queuing time. work resource sharing in multiple ring-all-reduce based DDL
• Through both physical and simulated experiments, we job training, alleviating communication contention overhead.
evaluate SJF-BSBF on different scales of job traces. Lucid [20] employed an indolent packing strategy to mitigate
Compared to recent DL schedulers such as Tirasias [13] interference. However, few of these approaches offer a general
and Pollux [15], SJF-BSBF reduces average job com- and flexible solution for sharing GPUs among DL jobs.
pletion time by 27-33%. Additionally, compared to the III. P RELIMINARIES
first-fit GPU sharing approach for new arrival jobs, SJF-
A. S-SGD Based Distributed Deep Learning
BSBF avoids those sharing decisions that may degrade
the overall performance, surpassing it by up to 17%. The DNN model is trained in an iterative manner with
the target of minimizing a loss function L (W, D), where W
II. R ELATED W ORK and D are respectively the model weights and the input data.
Scheduling DL training jobs has garnered significant interest For large-scale DNNs, the data-parallel synchronized SGD (S-
recently. Research in this field primarily focuses on fully SGD) is widely applied to train models with multiple workers
utilizing computing resources and allocating them effectively (say N workers, and indexed by g) because it has the same
to achieve optimal efficiency in multi-tenant GPU computing convergence performance as the sequential SGD. Generally
environments. Here we discuss two categories: preemptive and the ith iteration of the training contains four steps: a) Each
exclusive schedulers, as well as non-preemptive schedulers. worker g loads a mini-batch of local data Dig into the device
memory. b) Each worker g performs a feed forward on Dig DDL jobs J awaits scheduling for training over the duration
through the neural network and computes the value of the of |T |. Each job Jk ∈ J is characterized by the number of
loss function L(Wi , Dig ). c) The first order gradients w.r.t. GPUs it requires, denoted as G(Jk ), and the total number of
Wi are calculated by backpropagation. d) Gradients from all training iterations Ik requested by its users.
the workers ∇L(Wi , Dig ) are aggregated, averaged and then
distributed, which is often tackled by the All-Reduce collective A. DL Job Training Time Modeling
function. Then all the workers update the model as Eq. (1).
We first model the training time of one job, which includes
N
1 X the GPU computation time and network communication of the
Wi+1 = Wi − ξ ∇L(Wi , Dig ). (1) all-reduce operation.
N g=1
1) Modeling GPU Computation: The DL model is trained
B. All-Reduce Communication using back-propagation. The computation time on GPU scales
The most common scenario of DDL training is using a large linearly with the per-GPU batch size B, which can be calcu-
number of computing devices distributed among nodes in a lated as follows.
cluster. As a result, the step d) involves extra
PN communication
overheads In Eq. (1), we use ∆Wi = N1 g=1 ∇L(Wi , Dig ) tcomp (B) = αcomp + βcomp × B. (3)
to represent the aggregation of gradients from N workers,
which can be done through an all-reduce operation or through 2) Modeling Network Communication: The gradient ag-
a set of parameter servers. For brevity, we assume that the gregation overhead depends on the topology as well as the
number of nodes is power-of-two. Given the number of nodes network communication algorithm. We simply define the com-
N and the message size M , the time cost of one All-Reduce munication part as follows.
operation without contention can be generalized as where a
tcomm = αcomm + βcomm × M, (4)
and b are two constant numbers that are not related to M
[30]. The inter-node communication cost can be modelled as
where M is the message size, and αcomm , βcomm are the all-
Eq. (2). The values of a and b depend on the algorithms for
reduce time model parameters as described in Section III-B.
the All-Reduce operation with different number of processes
and message sizes [30]. Without loss of generality, we do not
limit the communication model to one specific algorithm.
Job 1 Execution Cost
Tallreduce = a + bM. (2) Job 2 Interference Penalty
Time
IV. S YSTEM M ODELING AND P ROBLEM F ORMULATION 0 1 2 3 4 5 6 7 8 9 10
(a) Sequentially execute two jobs
For ease of reference, we summarize some frequently used Job 1 Job 1
notations throughout this paper in Table I. Job 2 Job 2
Time Time
Table I 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
F REQUENTLY USED NOTATIONS (b) Fully execute two jobs concurrently (c) Partially execute two jobs concurrently
Name Descriptions
S/N Set of servers/GPUs in the cluster Figure 1. Three job schedules for two DL jobs.
Si the i-th server of the cluster
gi,j the j-th GPU on the i-th server
J The job set 3) Sharing Performance Modeling: Existing schedulers that
Gj # of GPUs requested by job j facilitate GPU sharing, such as Gandiva [17], Gavel [14], and
Jk the k-th job Lucid [20], often adopt conservative and limited approaches or
G(Jk ) the set of GPUs used by Jk
S(Jk ) the set of servers used by Jk require additional application information to generate sched-
ak the arrival time of Jk ules. In contrast, we apply a simple interference model to
Bk the mini-batch size per-GPU used by Jk describe the overhead of GPU sharing. We illustrate three
Ik # of iterations that Jk needs to run
tk The execution time of one iteration of Jk possible job schedules for two jobs sharing the same set of
Lk The total execution time of Jk running solely GPUs in Figure 1. Schedule (a) sequentially executes two DL
Tk The completion time of job j jobs. Schedules (b) and (c) involve invoking two DL jobs
Ek the timestamp when Jk is finished
F The set of feasible scheduling solution simultaneously or with partial overlap, resulting in varying
fjk The schedule of job j in the k-th solution degrees of interference penalty. To optimize the average job
completion time, one must balance the tradeoff between job
We consider a multi-tenant GPU cluster comprising |S| queuing/waiting time (Job 2 waits for Job 1 to finish in (a))
servers equipped with |N | GPUs evenly distributed. These and interference penalty (complete overlap of two jobs leads to
servers are interconnected with a network switch possessing severe penalty in (b)). In practice, the job iteration time under
sufficient bandwidth. All GPUs within the cluster share the GPU sharing can be measured and modeled by equations (3)
same specifications and theoretical peak performance. At the and (4), as they occupy partial GPU and network resources
onset of a scheduling horizon |T | spanning time-slots, a set of with similar trends. To simplify the model, if a new job shares
GPUs occupied by an existing job (Job A and Job B), we To ensure that one GPU at most holds C jobs, we have
adjust their job iteration time as follows. X
yjg [τ ] ≤ C, ∀j ∈ J [t], τ ∈ T , g ∈ G. (9)
t̂A = tA ξA , (5) j∈J[τ ]
t̂B = tB ξB , (6)
In practice, we observe that interference degradation can be
where ξA and ξB denote the interference ratios, reflecting severe, rarely improving performance when more than two
the performance degradation resulting from GPU sharing. The jobs share the same set of GPUs. Therefore, we set C = 2 in
solution to determining the optimal scheduling point under this our context.
scenario will be discussed in Section V-A. Also, since we consider gang scheduling, we have
4) Modeling Gradient Accumulation: Given that GPU
yjg [τ ] = ygs [τ − 1], ∀s ∈ S, j ∈ J [τ ], aj < τ ≤ Tj , (10)
memory constraints may limit the per-GPU batch size, some
schedulers tackle this limitation through memory offloading yjg [τ ] = 0, ∀g ∈ G, j ∈
/ J [τ ], τ ∈ T , (11)
[31] (which may introduce additional system overhead) or +
yjg [τ ] ∈ Z , ∀g ∈ G, j ∈ J [τ ], τ ∈ T . (12)
by adjusting batch sizes and other training hyper-parameters
[15] (which may compromise model accuracy). As our model The completion time of job j can be calculated as
incorporates GPU sharing, the memory footprint frequently X 1
imposes constraints on feasibility. Thus, we focus on gradient Tj = aj + arg min ≥ Ik , ∀j ∈ J [τ ], τ ≥ aj , (13)
τ tj
τ ∈T iter
accumulation, which can dynamically reduce the sub-batch
size while preserving the original model accuracy as per the Bk
ϕj [t] = j , (14)
user’s requested batch size. It is also easily implemented using titer
popular DL frameworks. It is important to note that one
can utilize gradient accumulation algorithms to manage the where ϕj [τ ] denotes the system throughput of the job. In
computational aspect, thereby reducing batch size to mitigate practice, it is more common to monitor and collect DL training
memory consumption. We subsequently define the overall throughput using popular DL frameworks. The throughput
iteration time as follows. can be readily converted to iteration time given the training
batch size. By measuring DL job throughput under both
B B
tjiter = (s − 1) × tjcomp ( ) + ((tcomp ( ))δ + tcomm δ )1/δ , sole execution and concurrent execution with other jobs, we
s s can fit the time model (Equation (7)) for both cases and
(7)
naturally infer the interference ratio ξ. Figure 2 illustrates the
where s represents the accumulation step required to attain throughputs of all DL models in our experiments across a
the original batch size, and δ denotes the degree of overlap range of resource allocations and batch sizes. Overall, our
between GPU computation and all-reduce communication, as model closely represents the observed data. We also notice
initially proposed in [15]. It is important to acknowledge that that different jobs exhibit varying sensitivities to network
δ may vary when different batch sizes are applied. communication and GPU workloads. For instance, BERT
shows a linear increase with batch size within the experimental
B. Scheduling Modeling
range for all GPU configurations, indicating that the bottleneck
In this paper, we adopt the ”gang-scheduling” discipline lies in GPU computation and is constrained by GPU memory.
widely prevalent in practical large-scale GPU clusters [13], Additionally, YoloV3 mostly achieves peak throughput with
[22], [32]. Under gang scheduling, all workers (i.e., GPUs) a batch size of 16 and encounters network bottlenecks when
of a DDL job must be allocated simultaneously. Furthermore, the GPU number exceeds 12. We also measure the system
once a job commences its scheduled run, all allocated GPUs throughput of different job pairs and training configurations, as
must remain dedicated to the job until its completion, with no depicted in Figure 3. We find that throughput can be fitted by
allowances for preemption or migration. (It is worth noting Equation (14), albeit with different parameters from the solely
that frequent job preemption and migration can significantly running mode. Moreover, the interference ratios of different
degrade performance [13]). Upon job completion, the occu- cases exhibit a wide range of up to 6 in our experiments,
pied resources are simultaneously released. Differing from emphasizing that avoiding unfavorable cases is crucial for
conventional GPU-exclusive scheduling policies, we permit improving overall performance.
GPUs to be occupied by multiple workers from various jobs
concurrently. These workers can be allocated within a single C. Problem Formulation
server or across multiple servers, provided there exists a In this paper, our goal is to determine the scheduling deci-
network path connecting them. sions yjg [t] to minimize the average JCT, which is commonly
Assume yjg [τ ] denotes that the job j uses GPU g in the used to evaluate the efficiency of DL job schedulers [13], [15].
time slot t. Job j requires Gj number of GPUs. This optimization problem can be formulated as follows:
X
yjg [τ ] = Gj . (8) X
min Tj . (15)
g∈G yjg [t],∀j,g,t
j∈J[t]
1e3 ImageNet YoloV3 DeepSpeech2
2.5 GPU Number GPU Number 400 GPU Number

throughput(sample/second)

throughput(sample/second)
175
throughput(sample/second)

1 4 12 1 4 12 1 4 12
2 8 16 150 2 8 16 2 8 16
2.0
300
125
1.5
100
200
1.0 75

50 100
0.5
25

0.0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 10 20 30
local batch size 1e2 local batch size local batch size

(a) ImageNet (b) YoloV3 (c) DeepSpeech2

BERT 1e4 CIFAR10 1e6 NEUMF


700 GPU Number GPU Number 3.0 GPU Number

throughput(sample/second)
throughput(sample/second)

throughput(sample/second)
1 4 12 1 4 12 1 4 12
600 4
2 8 16 2 8 16 2.5 2 8 16
500
3 2.0
400
1.5
300 2
1.0
200
1
100 0.5

0 0 0.0
5 10 15 20 25 0.0 0.5 1.0 1.5 2.0 0 2 4 6
local batch size local batch size 1e3 local batch size 1e4

(d) BERT (e) CIFAR10 (f) NEUMF

Figure 2. System throughput for all DL models in our experiments, as measured using a 4-server cluster each with 4 NVIDIA 2080Ti GPU. Each sub-figure
shows the values of different resource and training batch size settings for each model.

CIFAR10_CIFAR10 CIFAR10_BERT CIFAR10_DeepSpeech2 CIFAR10_ImageNet CIFAR10_YoloV3 CIFAR10_NEUMF


×104 ×104 ×104 ×104 ×104 ×104
2.5 1.4
4 GPU Num 1.50 3.0
2.0
throughput(sample/sec)

1 1.2
2.0 1.25 2.5
3 4 1.0
1.5
8 1.5 1.00 2.0
12 0.8
2 1.0 0.75 1.5
16 0.6
1.0
0.50 1.0
1 0.4
0.5 0.5
0.25 0.5
0.2
0 0.0 0.0
32 128 512 2048 32 128 512 2048 32 128 512 2048 32 128 512 2048 32 128 512 2048 32 128 512 2048
local batch size local batch size local batch size local batch size local batch size local batch size
5 10
GPU Num 8 10
7 2.5
1
Interference Ratio

4 8 6
4 8
8 6
5 2.0
12 6 6
3
16 4
4
4 4 3 1.5
2
2 2
2 2
1 1 1.0
32 128 512 2048 32 128 512 2048 32 128 512 2048 32 128 512 2048 32 128 512 2048 32 128 512 2048
local batch size local batch size local batch size local batch size local batch size local batch size

Figure 3. TOP: System throughput of difference DL models paired with CIFAR10 to share the same set of GPUs. BOTTOM: the interference ratio ξ for
different DL models and resource and training settings.

V. S OLUTION two parts. Firstly, we address the simple case of two jobs:
We note that Problem (15) presents an integer non-convex one running on the GPUs while the other awaits scheduling.
program with packing and covering constraints, which is NP- It is important to note that concurrent execution on a GPU
hard. Given these challenges, we opt to explore a heuristic may degrade overall performance if interference is significant.
approach that provides a provable local optimum guarantee Thus, we must decide whether the jobs should share GPUs and
for a job pair that shares GPUs either completely or partially. when to launch the waiting one. This gives rise to Theorem 1,
In this section, we describe our solution to address the which forms the core of our solution by providing a feasible
problem formulated in Section IV. The solution comprises solution when cluster resources are insufficient. Secondly, we
introduce our scheduling algorithm SJF-BSBF (shortest job In practice, evaluating the conditions for the best solution
first with best sharing benefit first), built upon Theorem 1 is the same as directly comparing the fully overlapped time
and the shortest job first strategy. By judiciously selecting job and the fully non-overlapped time in terms of time cost.
pairs that benefit from GPU sharing, even acting in a non-
preemptive manner, SJF-BSBF reduces job queuing time while B. Scheduling Multiple DL Jobs
avoiding scenarios where sharing may detrimentally impact One critical challenge in addressing Problem (15) is the
overall performance. allocation of GPUs when all GPUs in the cluster are occupied
by existing jobs, and a new job arrives. One needs to deter-
A. Scheduling One Job Pair mines 1) which job to share resources with the new arrival, 2)
We assume that all the tasks of a DL job are assigned to deciding when to initiate the new job. Our goal is to develop
a fixed set of GPUs during its execution. Before we design a heuristic efficient online scheduling algorithm for it.
the scheduling algorithm, each new-arriving DL job should be 1) Basic Idea: We propose an online scheduling algorithm
placed to a certain set of intra-node or inter-node processors, called SJF-BSBF (smallest job first with best sharing benefit
which is called job placement. first). Algorithm 1 describes the steps of SJF-BSBF. The
Assume that there is a new job A sharing the GPUs intuition behind Algorithm 1 has three points. 1) As for job
occupied by the existing job B, and their execution time under priority, the overall framework is based on the shortest job
concurrent execution is respectively first (SJF) strategy, as tackled by Lines 1-2. This size-based
heuristic strategies determine the job priority according to their
t̂A = tA ξA , (16)
used GPU numbers. We apply SJF since it performs well most
t̂B = tB ξB , (17) of time for online scheduling problems [13]. 2) As for GPU
allocation, since the case of scheduling two jobs concurrently
and κ is the inserting time. We have the following theorem.
running on the same GPUs (wholely or partially) is considered
Theorem 1 The shortest JCT of the above job pair is
in our paper, once the free GPU number is not enough to
achieved by either sequentially executing them (κ = tA iA )
execute the job, we manage to look for those already occupied
or simultaneously invoking them concurrently κ = 0.
by the running jobs to schedule the new one. This is the core
Proof. Case 1: If t̂A iA ≥ t̂B iB , then logic of SJF-BSBF and handled by Lines 3-19. 3) In the point
of 2), for each job pair, we should also decide the batch size
t̂B iB
TA = t̂B iB + tA × (iA − ), (18) of the new job for gradient accumulation that not only exceeds
t̂A the GPU memory size but also achieves the shortest JCT of
TB = t̂B iB . (19) scheduling the job pair. This corresponds to Line 11.
The average time is 2) GPU Allocation: Given a job Jk that needs GJk GPUs,
we should decide a set of GPUs to schedule it. The classical
T = (TA + TB )/2, (20) heuristic algorithms include First-Fit (FF) [33] and List-
tA iA tˆB iB Scheduling (LS) [33]. In Algorithm 1, Lines 3-19 present the
= t̂B iB + − . (21) step of choosing the GPUs. First, if there are enough GPUs to
2 2ξA
execute Jk , we select the top-k GPU in Gf ree to make them
Case 2: If t̂A iA < t̂B iB , then as colidated on the nodes as possible (Lines 6-7). Second,
κ notice that we allow at most two jobs to concurrently run on
TA = κ + t̂A × (iA − ), (22)
tA the same GPUs. Once the free GPU number is smaller that
tˆA × (iA − κ the request of Jk , we attempt to seek those GPUs that are
κ tA )
TB = κ + t̂A × (iA − ) + tB × (iB − ). already occupied by one job (Lines 10-17). We scan GOJ and
tA tˆB
(23) determine the best concurrent running setting of the running
job and Jk , including the batch size of Jk and whether to
The average time is let them share the GPUs, using Algorithm 2 introduced later.
T = (TA + TB )/2 (24) We add those pairs that can benefit from the sharing strategy
(Lines 10-13). Then we sort Jshare by the JCT of the job pair
2ξB + ξA − 2ξA ξB 1 ˆ 1
=( )κ + (1 − )tA iA + tB iB . in ascending order (Line 14). Finally, we pick up the GPUs
2ξB 2ξB 2 from those candidate jobs until the total number can fulfill the
(25)
request of Jk (Lines 15-17). Notice that we do not pick the
If 2ξB + ξA − 2ξA ξB > 0, it is an monotonically increasing free GPUs at first for this case to save resources because the
function with respect to κ. The minimum value is achieved completion time of Jk is determined by those shared GPUs.
at κ = 0, which indicates that one should start the new 3) Batch Size Scaling: In Algorithm 2, given a running job
job immediately. Otherwise, it is an monotonically increasing Jrun and a new job Jk ready to be scheduled, we present how
function with respect to κ. The minimum value is achieved at to adaptively adjust the batch size for Jk to achieve the shortest
κ = tA iA , which indicates that overlapping two jobs degrade average JCT of these two jobs. Notice that we do not adjust
the overall performance. the batch size of the running job to reduce the complexity of
the scheduling system. We search the batch size in the range
[1, BJk ] with a step of power two (Lines 5 and 12). For each
Algorithm 1 Shortest Job First with Best Sharing Benefit First
candidate batch size, we use Theorem 1 to obtain the best
(SJF-BSBF)
configuration of scheduling the job pair, including the flag
Input: The pending job list Jpending to allocate GPUs at
of whether to let them share GPUs (SF ) and the JCT (Line
the current scheduling time point. Lk is the expected
6). Then we record that configuration if better (Lines 7-11).
remaining time of Jk , calculated by tjiter ×Ik . GOJ denotes
Notice that it is possible that the new job Jk may not be
the GPU set occupied by at least one job. Jshare denotes
scheduled immediately and put back to the pending job pool
the candidate job set for concurrent execution.
if the final SF is F alse, indicating that running the job pair
Output: G(Jk ), The GPU sets of each job Jk in Jpending .
concurrently is not optimal.
1: Sort Jpending by Lk in ascending order.
4) Time complexity of SJF-BSBF: The time consump-
2: for Jk in Jpending do
tion of SJF-BSBF is primarily attributed to searching the
3: G(Jk ) ← ∅.
GPU set for a pending job to be shared when there is
4: Gf ree ← GPUs that hold no job.
insufficient resource (Lines 9 to Line 18 in Algorithm 1).
5: GOJ ← GPUs that hold one job.
Initially, a for loop (Line 10) scans all GPUs with one job,
6: if |Gf ree | ≥ GJk then
iterating |G(OJ)| times. Subsequently, each iteration executes
7: G(Jk ) ← the top-GJk GPUs in Gf ree
Algorithm 2 to determine the time point to initiate GPU
8: else
sharing as well as the appropriate batch size, with a time
9: if |Gf ree | + |GOJ | ≥ GJk then
complexity of θ(log2(BJk )). Lastly, after collecting candidate
10: for g in GOJ do
jobs for sharing GPUs, sorting the list in ascending order
11: Get SF, b, t using Algorithm 2.
to select those with the shortest Job Completion Time (JCT)
12: Add Jg to Jshare if SF is True.
requires θ(|J share|log2(|J share|)). Consequently, the time
13: end for
complexity for scheduling a job is θ(|G(OJ)|log2(BJk ) +
14: Sort Jshare by t in ascending order.
|J share|log2(|J share|)). In our system implementation on a
15: for J in Jshare do
16-GPU cluster, the overhead of periodically scheduling those
16: Add GJ to GJk until |GJk | ≥ GJk .
waiting jobs is negligible, averaging below 0.02 seconds for
17: end for
each operation.
18: end if
19: end if VI. P ERFORMANCE E VALUATION
20: end for
A. Experimental Setup
Cluster configurations: We first conduct physical experi-
ments on a cluster of four servers. Each server is equipped with
an Intel Xeon CPU E5-2670 and four Nvidia GeForce 2080 Ti
Algorithm 2 Batch Size Scaling with Best Sharing Benefit GPUs. The network is configured with Fat-Tree topology with
Input: A given job Jk to allocate GPUs. The user’s requested 10 Gbps connected to a 100-Gbps switch. All experiments are
batch size Bk . tshare represents the JCT of the best performed in the environment of Ubuntu 20.04, PyTorch 1.18,
sharing configuration of two jobs. CUDA 11.2 and OpenMPI 4.0. Based on the data measured
Output: The sharing configuration denoted by SF , b, t. SF on the physical environment, we then conduct simulation
indicates whether to share the GPU for the new job. experiments to resemble the physical cluster configuration
1: b ← BJk and test large scale of clusters and job traces. To evaluate
2: b ← BJk the performance of SJF-BSBF in a large-scale cluster (16
3: t ← a large number servers each with 4 GPUs) with long-term traces, we also
4: SF ← F alse implement a simulator to record job events and resource usage.
5: while ⌈b⌉ ≠ 1 do All experiment results without explicit comments are derived
6: Get SF and tshare according to Theorem 1. from the simulation.
7: if tshare ≤ t then Baselines: We consider the following baselines.
8: t←t 1) First-In-First-Out (FIFO): a traditional but popular pol-
9: b←b icy adopted by several well-known cluster management sys-
10: SF ← SF tems, such as Yarn and Kubernetes. However, it usually per-
11: end if forms poor due to its runtime-agnostic scheduling paradigm.
12: b ← b/2 Picks the top-Gj GPU with least execution time first.
13: return SF , b, t. 2) Shortest Job First (SJF): an ideal policy to minimize
14: end while the average JCT without preemption by prioritizing short-term
jobs to overcome HOL blocking. It is impractical as it requires
perfect job information which is impossible to attain.
3) Tiresias [13]: a preemptive policy that prioritizes least greedy manner, SJF-BSBF can adaptively select the best job
attained service jobs (i.e., consumed GPU numbers and train- combination as well as the scheduling point to avoid those
ing iterations). Under this policy, it helps short-term jobs job pairs that may bring down the overall performance, which
escape from resource starvation and finish earlier without any outperforms SJF-FFS by 9% in terms of JCT.
prior information. Job Queuing Delay: Figure 4(b) shows the average queuing
4) Shortest Job First with First Fit Sharing (SJF-FFS): a time of different scheduling policies on different DDL job
sharing policy built upon SJF. It is similar to our proposed models. First, the queuing time of SJF-BSBF is generally
SJF-BSBF except that it does not search the best sharing lower than those heuristic policies with the exclusive GPU
configuration as SJF-BSBF but allocates the job to those GPUs mode. For the model BERT, SJF-BSBF reduces the queuing
that only have one job in a first fit manner if the free GPUs time by nearly 44% compared to Tiresias. Second, since SJF-
are not sufficient for the new job. This policy is a comparison FFS allows the jobs to share the GPUs in an aggressive
baseline to validate the effectiveness of wisely sharing the manner, it generally has the lowest queuing time. However,
GPUs in SJF-BSBF. as reported in Figure 4(a) and Table II, it usually leads to a
5) Pollux [15]: the state-of-the-art elastic scheduler that longer JCT since some job pairs have a high interference ratio
adaptively adjust the GPU resources for each job to optimize and subsequently hurt the overall performance.
the overall job performance as well as resource utilization.
As explained in [20], Pollux cannot guarantee no accuracy 100
1e3

Average Queuing Time(s)


SJF-FFS Tiresias FIFO
degradation for all models as it allows the scheduler to tune 1.5 SJF-BSBF SJF

Fraction of Jobs(%)
80
the training batch size, while our SJF-BSBF applies gradient
60 1.0
accumulation to attain the same convergence as the original SJF-FFS
SJF-BSBF
user specific batch size setting. 40
Tiresias 0.5
For physical experiments, we compare our SJF-BSBF with 20 SJF
FIFO 0.0
FIFO, SJF and Tiresias to demonstrate the advantages of 0 BE
RT 10 t F
ch2 Ne NC oV
3
0.0 0.5 1.0 1.5 2.0 AR ee ge Yol
CIF Sp Ima
resource sharing over those exclusive-mode policies. For simu- ep
JCT(s) 1e3 De

lation experiments, we also add Pollux, one of the state-of-the- (a) JCT distributions of different (b) Average queuing time of different
art elasticity-based scheduler, to compare the sharing-based scheduling policies in the physical ex- scheduling policies in the physical ex-
and the elasticity-based policies. periments. periments.
Workload Settings: We generate the workload similar to Figure 4. Performance of different policies in physical experiments.
the Microsoft job trace [3]. More details about the Microsoft
trace can be found in [3] and Appendix of [13]. For the
physical experiments, considering that our testbed only has Table II
4 nodes with 16 GPUs, we generate totally 30 DDL jobs by T HE MAKESPAN AND AVERAGE JCT RUN BY PHYSICAL EXPERIMENTS .
scaling down the original job trace. As job characteristics, such Policy Makespan (seconds) Average JCT (seconds)
as the number of GPUs and the training iterations, we mostly FIFO 2317 662.6
follow the distributions of the real trace: 20 jobs using no more SJF 2319 609.55
than 8 GPUs and 10 jobs using 12 or 16 GPUs. The training Tiresias 2398 664.73
SJF-FFS 2176 530.36
iteration of jobs varies from 100 to 5000. For the simulation SJF-BSBF 2129 483.16
experiments, we mainly follow the settings of Pollux [15].
We randomly sample 240 jobs from the busiest period in the
deep learning cluster traces published by Microsoft [3] and C. Experimental Results on Large-Scale Simulations
also annotate six DL tasks (BERT, CIFAR10, DeepSpeech2, To verify the fidelity of our simulator, we also compare
ImageNet, NCF and YoloV3) used in Pollux to them. The the results of physical experiments with simulations. We ob-
settings of GPU numbers and training iterations also follow serve that the simulator can achieve the realistic experimental
those of Pollux. performance within 5% relative percentage errors on both
makespan and average JCT. This confirms the high fidelity
B. Experimental Results on a Physical Cluster of our simulator.
JCT Improvements: Figure 4(a) demonstrates the JCT dis- JCT Improvements: We first compare the JCTs of different
tributions of the baseline workload using different scheduling scheduling policies on the standard simulation workload. In
policies. Nearly 80% of jobs have no more than 0.75 hour Figure 5(a), it is evident that SJF-BSBF outperforms other
of JCTs using our SJF-BSBF, while other algorithms only policies. Nearly 40% of jobs of SJF-BSBF achieves lower
have less than 70%. SJF-BSBF generally achieves the best than 500 seconds of JCTs, reducing the average JCT of the
performance. In Table II, it is reported that SJF-FFS and SJF- shortest 40% jobs by 37% than Pollux. This demonstrates
BSBF, which allows the GPUs to be shared among jobs, have the preemption-free policy can even obtain better performance
considerable performance improvements over other policies. than the preemptive policy, such as Tiresias and Pollux.
In particular, SJF-BSBF achieves a 27% lower average JCT Tables III and IV present the performance of different
than Tiresias. Besides, instead of allowing GPU sharing in a scheduling policies for 240 jobs and 480 jobs, respectively.
1e4 Table IV
100 1.5

Average Queuing Time(s)


SJF-FFS Pollux SJF P ERFORMANCE OF LARGE - SCALE AND SMALL - SCALE JOBS BY
SJF-BSBF Tiresias FIFO
SIMULATION . (480 JOBS )
Fraction of Jobs(%)

80
1.0
60
Metrics (hrs) Policy All Jobs Large Jobs Small Jobs
SJF-FFS
40 SJF-BSBF FIFO 7.52 10.06 7.08
0.5
Pollux SJF 2.55 7.00 1.90
Tiresias
20 SJF Average JCT Tiresias 5.09 8.25 4.59
FIFO 0.0 Pollux 4.69 6.03 4.51
RT 10 t F
0 2 BE AR ch2 Ne NC oV
3
10 10 3
10 4
10 5
CIF ee ge
Sp Ima Yol SJF-FFS 1.89 8.12 0.99
ep
JCT(s) De SJF-BSBF 1.57 6.13 0.88
FIFO 0.73 4.06 0.21
(a) JCT distributions of different (b) Average queuing time of different SJF 1.87 5.35 1.35
scheduling policies in the simulation scheduling policies in the simulation Average Queuing Time Tiresias 4.41 6.60 4.04
experiments. experiments. Pollux 4.41 5.03 4.31
SJF-FFS 0.73 4.06 0.21
Figure 5. Performance of different policies in simulation experiments. SJF-BSBF 0.69 3.62 0.21

Jobs are characterized based on their requested number of 120 jobs to 480 jobs. Figure 6(a) shows the results. An inter-
GPUs, with those requiring more than 4 GPUs considered esting phenomenon is that Pollux can have better performance
large, and others small. For the workload of 240 jobs, SJF- than other policies when the job workload intensity is low.
BSBF demonstrates slightly better performance than the ad- Pollux is more suitable for lighter workload intensity because
vanced policy Pollux. While large jobs under SJF-BSBF its adaptive job batch size and resource scaling techniques are
may experience longer JCTs than Pollux due to GPU shar- limited when clusters are overloaded, which meets the findings
ing overhead, small jobs benefit significantly by potentially in [16], [20]. However, when the workload increases, the GPU
sharing GPUs with large jobs, resulting in markedly shorter resources are rather insufficient so that Pollux cannot benefit
queuing times compared to other preemption-free policies. from this strategy. Across all job workloads, our SJF-BSBF
This advantage is further accentuated as the number of jobs maintains relatively low improvements over other baseline
increases. In Table IV with 480 jobs, SJF-BSBF enhances the policies since it allows the jobs to share the GPUs to shrink
average JCT by nearly 3 times compared to Pollux, primarily the job queuing time.
attributable to the reduction in queuing time for small jobs.
Moreover, SJF-BSBF outperforms SJF-FFS by reducing the SJF-FFS SJF-BSBF
SJF-BSBF
average JCT and queuing time by 17% and 5.5%, respectively. 5 SJF-FFS
Tiresias 1.52
Avg JCT (hours)

Avg JCT (hours)


4 SJF 1.5 1.231.18
Pollux 1.07
0.99
Table III 3 0.86 0.860.91
1.0 0.73 0.73
P ERFORMANCE OF LARGE - SCALE AND SMALL - SCALE JOBS BY 2
SIMULATION . (240 JOBS ) 0.5
1

Metrics (hrs) Policy All Jobs Large Jobs Small Jobs 0.0
0.5x 1.0x 1.5x 2.0x 1.00 1.25 1.50 2.00 2.50
FIFO 2.34 3.92 2.13 Relative Job Load Interference Ratio
SJF 1.25 3.50 0.94
Average JCT Tiresias 1.51 3.48 1.23 (a) Varying the workloads. (b) Varying the interference ratios.
Pollux 1.04 1.96 0.91
SJF-FFS 1.23 4.29 0.81 Figure 6. Effects of various workloads and interference ratios.
SJF-BSBF 1.01 3.38 0.68
FIFO 1.65 2.17 1.58
SJF 0.56 1.76 0.39 Impact of Different Interference Ratios: To evaluate the
Average Queuing Time Tiresias 0.81 1.73 0.68 impact of different interference ratios on our GPU sharing
Pollux 1.00 1.21 0.97
SJF-FFS 0.18 0.98 0.07
policies, SJF-FFS and SJF-BSBF, we artificially inject various
SJF-BSBF 0.14 0.89 0.03 values for all the jobs sharing the same GPUs in the baseline
simulation workload. Figure 6(b) shows the results. When the
Job Queuing Delay: Figure 5(b) presents a comparison of ratio is small (ξ ≤ 1.25), which is the ideal scenario that
the average queuing time among different scheduling policies sharing GPUs brings negligible overhead, SJF-BSBF tends to
for various DDL job tasks in simulation. Notably, the GPU allow all the available sharing decisions as SJF-FFS, which
sharing policies, namely SJF-BSBF and SJF-FFS, consistently results in the same performance. However, when the GPU
yield lower queuing times compared to heuristic policies sharing leads to severe slowdowns for the running jobs, our
operating in exclusive GPU mode. Additionally, preemptive SJF-BSBF can get rid of those job pairs that may hurt the
policies such as Tiresias and Pollux often exhibit longer overall performance in SJF-FFS, which reduces the average
queuing times attributable to job migration. JCT by 8%∼13% when ξ ranges from 1.5 to 2.0.
Sensitivity to job load: We compare the performance of our
SJF-BSBF to other existing policies for increasing workload VII. C ONCLUSION
intensity in terms of job submission frequencies. We scale the In this paper, we delve into resource scheduling for DL jobs
baseline workload of 240 jobs by 0.5× ∼ 2×, ranging from in a multi-tenant GPU cluster, where we harness GPU sharing
capabilities to diminish job queuing time and enhance overall [11] A. Hsu, K. Hu, J. Hung, A. Suresh, and Z. Zhang, “Tony: An orchestrator
performance. We begin by formulating a DL scheduling model for distributed machine learning jobs,” in 2019 USENIX Conference on
Operational Machine Learning (OpML 19), 2019, pp. 39–41.
that accounts for GPU sharing among various jobs and em- [12] Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo, “Optimus: An Efficient
ploys gradient accumulation to surmount memory limitations Dynamic Resource Scheduler for Deep Learning Clusters,” in Proceed-
while maintaining the job training accuracy. We then derive ings of the Thirteenth EuroSys Conference, ser. EuroSys ’18, 2018.
the optimal solution to schedule a job pair on the same set [13] J. Gu, M. Chowdhury, K. G. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu,
and C. Guo, “Tiresias: A GPU Cluster Manager for Distributed Deep
of GPUs and further design an efficient heuristic scheduling Learning,” in 16th USENIX Symposium on Networked Systems Design
algorithm upon it to unleash the potential of GPU sharing and Implementation (NSDI 19), 2019, pp. 485–500.
in reducing the job queuing time and avoid serious interfer- [14] D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phanishayee, and
M. Zaharia, “Heterogeneity-Aware cluster scheduling policies for deep
ence with the running jobs. Extensive experiments, including learning workloads,” in 14th USENIX Symposium on Operating Systems
physical implementations and simulations, were conducted Design and Implementation (OSDI 20), Nov. 2020, pp. 481–498.
to demonstrate the effectiveness of SJF-BSBF. Our findings [15] A. Qiao, S. K. Choe, S. J. Subramanya, W. Neiswanger, Q. Ho,
H. Zhang, G. R. Ganger, and E. P. Xing, “Pollux: Co-adaptive cluster
reveal that the non-preemptive SJF-BSBF surpasses advanced scheduling for goodput-optimized deep learning,” in 15th USENIX
preemptive policies like Tiresias and Pollux by leveraging Symposium on Operating Systems Design and Implementation (OSDI
GPU sharing techniques. Furthermore, identifying appropriate 21). USENIX Association, Jul. 2021, pp. 1–18.
sharing settings is pivotal in mitigating severe degradation [16] S. Agarwal, A. Phanishayee, and S. Venkataraman, “Blox: A modular
toolkit for deep learning schedulers,” in Proceedings of the Nineteenth
cases induced by high interference. European Conference on Computer Systems, ser. EuroSys ’24, 2024.
[17] W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han,
ACKNOWLEDGMENTS P. Patel, X. Peng, H. Zhao, Q. Zhang, F. Yang, and L. Zhou, “Gandiva:
Introspective Cluster Scheduling for Deep Learning,” in 13th USENIX
This research was supported by the National Symposium on Operating Systems Design and Implementation (OSDI
Natural Science Foundation of China (No. 62302126, 18), 2018, pp. 595–610.
No. 62302123), the Shenzhen Science and Tech- [18] G. Lim, J. Ahn, W. Xiao, Y. Kwon, and M. Jeon, “Zico: Efficient GPU
memory sharing for concurrent DNN training,” in 2021 USENIX Annual
nology Program (No. RCBS20221008093125065, Technical Conference (USENIX ATC 21), Jul. 2021, pp. 161–175.
No. JSGGKQTD20221101115655027, No. [19] P. Yu and C. Mosharaf, “Salus: Fine-Grained GPU Sharing Primitives
JCYJ20220818102414030, No. KJZD20230923115113026, for Deep Learning Applications,” in Proceedings of the third MLSys
No. KJZD20230923114213027), and Guangdong Provincial Conference, 2019.
[20] Q. Hu, M. Zhang, P. Sun, Y. Wen, and T. Zhang, “Lucid: A non-intrusive,
Key Laboratory of Novel Security Intelligence Technologies scalable and interpretable scheduler for deep learning training jobs,” in
(2022B1212010005). Proceedings of the 28th ACM International Conference on Architectural
Support for Programming Languages and Operating Systems, Volume 2,
R EFERENCES ser. ASPLOS 2023, 2023, p. 457–472.
[21] M. Yu, Y. Tian, B. Ji, C. Wu, H. Rajan, and J. Liu, “Gadget: Online
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” nature, vol. 521, resource optimization for scheduling ring-all-reduce learning jobs,” in
no. 7553, p. 436, 2015. IEEE INFOCOM 2022 - IEEE Conference on Computer Communica-
[2] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. aurelio tions, 2022, p. 1569–1578.
Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, “Large [22] M. Yu, B. Ji, H. Rajan, and J. Liu, “On scheduling ring-all-reduce learn-
Scale Distributed Deep Networks,” in Advances in Neural Information ing jobs in multi-tenant gpu clusters with communication contention,”
Processing Systems 25, 2012, pp. 1223–1231. in Proceedings of the Twenty-Third International Symposium on Theory,
[3] M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and Algorithmic Foundations, and Protocol Design for Mobile Networks and
F. Yang, “Analysis of large-scale multi-tenant gpu clusters for dnn Mobile Computing, ser. MobiHoc ’22, 2022, p. 21–30.
training workloads,” arXiv preprint arXiv:1901.05758, 2019. [23] Z. Huang, B. Jiang, T. Guo, and Y. Liu, “Measuring the impact of
[4] V. Jalaparti, P. Bodik, I. Menache, S. Rao, K. Makarychev, and M. Cae- gradient accumulation on cloud-based distributed training,” in 2023
sar, “Network-aware scheduling for data-parallel jobs: Plan when you IEEE/ACM 23rd International Symposium on Cluster, Cloud and In-
can,” in Proceedings of the 2015 ACM Conference on Special Interest ternet Computing (CCGrid), 2023, pp. 344–354.
Group on Data Communication, ser. SIGCOMM, 2015, pp. 407–420. [24] Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee,
[5] F. Giroire, N. Huin, A. Tomassilli, and S. Pérennes, “When Network J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen, GPipe: Efficient Training of
Matters: Data Center Scheduling with Network Tasks,” in IEEE INFO- Giant Neural Networks Using Pipeline Parallelism, 2019.
COM 2019 - IEEE Conference on Computer Communications, April
[25] H. Zheng, F. Xu, L. Chen, Z. Zhou, and F. Liu, “Cynthia: Cost-Efficient
2019, pp. 2278–2286.
Cloud Resource Provisioning for Predictable Distributed Deep Neural
[6] A. Tumanov, T. Zhu, J. W. Park, M. A. Kozuch, M. Harchol-Balter, and
Network Training,” in 2019 48th International Conference on Parallel
G. R. Ganger, “Tetrisched: Global rescheduling with adaptive plan-ahead
Processing (ICPP), Aug 2019.
in dynamic heterogeneous clusters,” in Proceedings of the Eleventh
European Conference on Computer Systems, ser. EuroSys, 2016. [26] Y. Bao, Y. Peng, and C. Wu, “Deep Learning-based Job Placement in
[7] Y. Liu, H. Xu, and W. C. Lau, “Online job scheduling with resource Distributed Machine Learning Clusters,” in IEEE INFOCOM 2019 -
packing on a cluster of heterogeneous servers,” in IEEE Conference on IEEE Conference on Computer Communications, April 2019.
Computer Communications (INFOCOM), 2019, pp. 1441–1449. [27] Z. Hu, J. Tu, and B. Li, “Spear: Optimized Dependency-Aware Task
[8] C. Chen, X. Ke, T. Zeyl et al., “Minimum Makespan Workflow Scheduling with Deep Reinforcement Learning,” in IEEE 39th Interna-
Scheduling for Malleable Jobs with Precedence Constraints and Lifetime tional Conference on Distributed Computing Systems (ICDCS), 2019.
Resource Demands,” in 2019 IEEE 39th International Conference on [28] Q. Weng, W. Xiao, Y. Yu, W. Wang, C. Wang, J. He, Y. Li,
Distributed Computing Systems (ICDCS), July 2019. L. Zhang, W. Lin, and Y. Ding, “MLaaS in the wild: Workload analysis
[9] X. Mei, X. Chu, H. Liu, Y. Leung, and Z. Li, “Energy efficient real-time and scheduling in Large-Scale heterogeneous GPU clusters,” in 19th
task scheduling on cpu-gpu hybrid clusters,” in IEEE INFOCOM 2017 USENIX Symposium on Networked Systems Design and Implementation
- IEEE Conference on Computer Communications, May 2017, pp. 1–9. (NSDI 22), Renton, WA, Apr. 2022, pp. 945–960.
[10] H. Yabuuchi, D. Taniwaki, and S. Omura, “Low-latency job scheduling [29] Q. Wang, S. Shi, C. Wang, and X. Chu, “Communication contention
with preemption for the development of deep learning,” in USENIX aware scheduling of multiple deep learning training jobs,” arXiv preprint
Conference on Operational Machine Learning (OpML), 2019. arXiv:2002.10105, 2020.
[30] T. Hoefler, W. Gropp, R. Thakur, and J. L. Träff, “Toward performance
models of mpi implementations for understanding application scaling
issues,” in Recent Advances in the Message Passing Interface, 2010.
[31] C.-C. Huang, G. Jin, and J. Li, “Swapadvisor: Pushing deep learning
beyond the gpu memory limit via smart swapping,” in Proceedings of
the Twenty-Fifth International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), 2020.
[32] K. Mahajan, A. Balasubramanian, A. Singhvi, S. Venkataraman,
A. Akella, A. Phanishayee, and S. Chawla, “Themis: Fair and efficient
GPU cluster scheduling,” in 17th USENIX Symposium on Networked
Systems Design and Implementation (NSDI 20). Santa Clara, CA:
USENIX Association, Feb. 2020, pp. 289–304.
[33] G. L. Stavrinides and H. D. Karatza, “Scheduling multiple task graphs
in heterogeneous distributed real-time systems by exploiting schedule
holes with bin packing techniques,” Simulation Modelling Practice and
Theory, vol. 19, no. 1, pp. 540 – 552, 2011.

You might also like