Job Scheduling in High Perfomance Computing
Job Scheduling in High Perfomance Computing
Yuping Fan
Illinois Institute of Technology
Abstract—The ever-growing processing power of supercomput- more conservative in job runtime prediction [1]. As super-
ers in recent decades enables us to explore increasing complex computers become more powerful, users tend to submit jobs
scientific problems. Effective scheduling these jobs is crucial for more frequently. Schedulers need to detect the changes in user
individual job performance and system efficiency. The traditional
job schedulers in high performance computing (HPC) are simple behavior and adjust scheduling policies in order to maintain
arXiv:2109.09269v1 [cs.DC] 20 Sep 2021
and concentrate on improving CPU utilization. The emergence of good performance. In light of these challenges, studies in
new hardware resources and novel hardware structure impose modern schedulers develop various methods to improve system
severe challenges on traditional schedulers. The increasing di- efficiency and individual job performance. These goals of
verse workloads, including compute-intensive and data- intensive the methods can be roughly divided into groups: fairness,
applications, require more efficient schedulers. Even worse, the
above two factors interplay with each other, which makes resource utilization, job performance. However, these goals
scheduling problem even more challenging. In recent years, many are often conflicting. In this chapter, I review modern schedul-
research has discussed new scheduling methods to combat the ing methods focusing on the applicability of the schedulers.
problems brought by rapid system changes. In this research In addition, an intelligent HPC job scheduling framework
study, we have investigated challenges faced by HPC scheduling is proposed to address these issues. The remainder of this
and state-of-art scheduling methods to overcome these challenges.
Furthermore, we propose an intelligent scheduling framework to chapter is organized as follows. I first review the existing
alleviate the problems encountered in modern job scheduling. HPC scheduling algorithms. I then introduce the proposed
Index Terms—cluster scheduler; High Performance Computing intelligent HPC job scheduling framework. Finally, I conclude
(HPC) this chapter.
II. HPC S CHEDULING A LGORITHMS
I. I NTRODUCTION
A. Scheduling Algorithms on Production Systems
Supercomputers are experiencing substantial changes. The Job scheduling in HPC is responsible for ordering jobs
new hardware resources, such as burst buffer, are emerging. in waiting queue and allocating jobs to resources according
As new resources are incorporated into the system, schedulers to site policies and resource availability. In HPC, the well-
need to consider new resources in decision making process. At known job schedulers include Slurm, Moab/TORQUE, PBS,
the same time, the hardware structures are changing rapidly. and Cobalt. Similar to HPC job schedulers, cluster schedulers,
For example, the shared network structures, such as Dragonfly such as Apollo, Mesos, Omega, and YARN, basically play the
and fat tree, are widely adopted in new generation supercom- same role in a system. The main difference between HPC
puters. The changes in hardware structures complicate the schedulers and cluster schedulers is that the infrastructures
scheduling problem, especially when the hardware becomes they served are different. HPC facilities are designed to serve
sharable. Sharing resources is one of the most effective big scientific and engineering applications that are impossible
methods to improve resource utilization, but it also brings to run on other systems, while commercial clusters are inclined
many issues, such as resource fairness and resource contention. to serve big data applications. The scheduling algorithms used
Without scheduling shared resource properly, system perfor- on both HPC systems and clusters are simple. The most widely
mance is suffered from resource contention and fairness will adopted scheduling policies used on production HPC systems
be hurt too. In order to fully utilize resources, schedulers is FCFS (first-come first-served), which sorts waiting jobs in
are required to keep up with the changes in hardware. The the order of their arrival times. For some leadership computing
substantial improvement in hardware drives users to submit facilities, their main goals are enabling large jobs to run.
more complex problems. For example, supercomputers having Hence, jobs consuming more system resources have higher
GPU attract deep learning projects, which require a massive priorities to execute. For example, ALCF (Argonne Leadership
number of threads to tolerate latencies. Supercomputers with Computing Facility) adopts a utility-based scheduling policy,
burst buffer enable data-intensive applications, such as scien- named WFP, which favors large and old jobs in the queue
tific simulations, to execute. Schedulers are responsible for [2]. Backfilling is a common strategy used in production HPC
monitoring and cooperating resources in a system to support scheduling in order to enhance system utilization [3], [4]. The
rapid service to users. Without central control from schedulers, most widely used backfilling strategies are EASY backfilling
jobs could easily fail and the failure could propagate to the and conservative backfilling. EASY backfilling allows waiting
whole system. The hardware updates cause users to change jobs to skip ahead under the condition that they do not delay
their behavior to adapt to these changes. For example, the the job at the head of the queue. Conservative backfilling has
runtime variability brought by sharing network makes users the stricter condition that jobs can be backfilled only if they do
not delay the preceding jobs. Because clusters serve a different then used a knob to balance resource utilization and fairness
purpose than HPC systems, cluster schedulers concentrate on [20]. The advantage of using the bin packing algorithm is its
providing satisfactory service to all users by fair sharing the speed, but it is also a greedy algorithm which allocates jobs
resources in a system. in a one-by-one manner based on isolated job information.
In an HPC system, a scheduler has more time to make
B. Hierarchical Scheduling Framework vs. Distributed scheduling decision but are required to make the best use of
Scheduling various resources compared to cluster scheduling. Therefore,
The ever-increasing HPC system scale poses a serious multi-resource HPC scheduling demands more complicated
challenge to modern HPC scheduling. The current central scheduling methods. Optimization methods, especially multi-
scheduling model cannot keep up with the challenge of the objective optimization methods, are leveraged to achieve better
increasing complexity of resource constraints. The emergence system performance in HPC scheduling [10], [11], [21].
of hierarchical scheduling framework expands the traditional
schedulers’ view beyond the single dimension of nodes. Flux
is a good representative example of hierarchical schedulers [5]. D. Energy-, Power-, Cooling-Aware Scheduling
In Flux, each job is an instance of the scheduling framework,
which can launch many small and high-throughput sub-jobs. Energy consumption in HPC and datacenters raise great
Therefore, it combats the issue of scalability that exists in attention in recent years. As HPC systems and datacenters
many modern HPC schedulers. Resource management in Flux become increasingly more powerful, one side effect is that the
is operated at large granularity and it can move resources generated power cannot keep up with their consumption rate.
between child jobs. Because of the recursive feature of the For example, in 2014, data centers in the U.S. consumed 1.8%
hierarchical scheduling framework, it can be extended to of the total electricity consumption (70 billion kWh of energy).
schedule emerging resources with small changes. Besides the Because energy expense is becoming an increasingly dominant
hierarchical scheduling framework, the distributed scheduler portion of the operation cost in HPC and data centers, data
is another scheduling framework to overcome the problem of centers and HPC systems attempt to reduce energy consump-
scalability [6]. In a distributed scheduler, each job waiting tion without hurting performance through more effective job
in the queue are assigned to a distributed scheduler, which scheduling strategies and more energy efficient hardware. The
has their own resources. The resource exchanges are done by Dynamic Voltage and Frequency Scaling (DVFS) is technique
communication between distributed schedulers. The advantage widely studied in the literature [22], which adjusts power and
of distributed schedulers is their scalability, because they speed settings on a computing device’s various processors,
require even less computation and memory than hierarchical controller chips and peripheral devices to optimize resource
scheduler, but the downside is the inefficiency of using system allotment for tasks and maximize power saving when those
resources due to the isolation of system resources. resources are not needed. In addition, shutting down un-
used hardware is another effective way of lowering power
C. Multi-Resource Scheduling consumption. However, this method comes at the cost of
Multi-resource scheduling is a research topic in HPC waste system resources. The time of turning on hardware
scheduling that raises more attention in recent years. This is resources can cause a delay in latency-critical jobs and tasks.
because increasing more resources are incorporated into the Although, this approach is widely studied but it rarely used
next-generation HPC systems [7]–[11]. A large body of multi- in real systems. In recent decades, usage of renewable and
resource scheduling focuses on power and compute resource sustainable energy source to replace traditional energy, such
scheduling [12]–[17]. This requires the trade-off between as fossil fuels, is a hot topic. Renewable energy is energy
power and performance in decision making. For example, that is collected from renewable resources, which are naturally
Wallace et al. addressed the power-aware scheduling problem replenished on a human timescale, such as sunlight, wind, rain,
in HPC by optimizing compute node utilization with the tides, waves, and geothermal heat. Although there is a great
power constraint [18]. This solution prefers to compute node promotion in using renewable energy, renewable energy only
utilization over power constraint. In multi-resource cluster contributed 19.3% to humans’ global energy consumption in
scheduling, fairness is more important than other factors. 2017. Therefore, in recent years, HPC and data center facilities
Dominant Resource Fairness (DRF) is a strategy to achieve fair try to cut their traditional energy consumptions [23]–[28].
allocation of various resources to users [19]. Although DRF Therefore, there is a research area focusing on maximizing
maximizes resource fairness in a system and therefore obtains resource consumptions from renewable energy so as to reduce
user satisfaction, much recent research found that fair sharing energy consumption from traditional energy sources. Besides
and high utilization are conflicting goals and aggressively energy costs are fluctuating, the energy price is often low
using fair sharing have a negative effect on resource utilization. at night when the energy consumption is lowest in general,
In order to address this challenge brought by fair sharing, while the price is high during the daytime. In order to save
some studies make tradeoffs between fairness and utilization. energy costs, scheduling methods are developed to use more
For example, Grandl et al. leveraged a multi-dimensional energy when the energy costs are low and decreasing energy
bin packing algorithm to improve resource utilization and consumption when the energy price is high [13].
E. HPC Failures and Scheduling widely used strategy. Besides these two backfilling strate-
gies, there are some other strategies proposed to improve
As the size of HPC systems increases drastically in recent
the performance of backfilling. One of the most effective
years, failures in HPC systems increases accordingly. One of
methods is to improve user runtime estimates. User runtime
the most important factors of increasing application failures
estimate is the upper bound of a job’s runtime. If a job needs
is the increase in application sizes. Any hardware failure can
more than this upper bound, this job will be killed by the
cause application failure [29]–[31]. A large application using
scheduler. This phenomenon is called underestimation. If a
many resources in a system such as compute resource, the
job finished before it reaches this upper bound, this user
network resource is more likely to meet hardware failure
overestimates this job. The accuracy of user runtime estimate
than a small application [32]. Aging is another factor that Job Actual Runtime
(defined as Job Runtime Estimate ) is very important factors in
causes failures of HPC applications. For example, Zimmer
scheduling, because all scheduling decisions are made based
et al. analyzed the relationship between GPU aging with the
on user provided runtime estimates. However, user provided
reliability of HPC jobs on Titan [33]. The analysis presents that
runtime estimates are proved to be very inaccurate. Based on
large applications (use more than 20% of machine resources)
the previous studies, the average accuracies of job runtime
encounter a higher level of application failures. Therefore, in
estimates on many production systems are less than 60% [36].
their work, they replaced 50% of aged GPUs and employed
Scheduling decisions made by on these inaccurate runtime
techniques to use low- failure GPUs to run large jobs through
estimates cause many scheduling performance problems, such
targeted resource allocation. By deploying the techniques in a
as low resource utilization, backfilling and job priority issues.
real HPC system, Titan, they demonstrated the positive impacts
Therefore, it is crucial to provide more accurate runtime
of age-aware resource allocation policy. To tolerate failures in
estimates. There are several methods in literature to improve
the HPC environment, applications can implement checkpoint
user runtime estimates. With the widely used of machine
technique, which pauses the application and then copies all
learning algorithms in various scientific fields, there are several
the required data from the memory to reliable storage and
attempts to leverage machine learning algorithms to predict job
then continues the execution of the application. In case of
runtimes [36], [37]. The basic idea of these machine learning
failure, the application could restart from the latest checkpoint
approaches is to extract features and make job runtime predic-
from the stable storage and execute from there. Therefore,
tions from user inputs, such as job runtime estimates and job
the checkpoint technique avoids the trouble of starting from
size, and historical job information, such as the job runtimes
scratch. Checkpointing uses many system resources, such as
from the same user and project. For example, Gaussier et al.
memory, network, and storage system. Therefore, the best
leveraged an online linear regression model to predict job
checkpoint frequency raises much research interests in recent
runtime [37]. Fan et al. extended Tobit model to balance
years [34], [35]. Frequent checkpointing introduces too many
runtime prediction accuracy and underestimation rate [36].
overheads to an application; however, if this application fails, it
Improving runtime estimates is one way to enhance scheduling
can be restarted from a very recent point. There are two check-
performance, optimizing backfilling is another effective way to
pointing levels: global checkpointing and local checkpointing.
improve system resource utilization. The traditional backfilling
Local checkpointing is light-weighted, which only copies
strategies (EASY backfilling and conservative backfilling)
memory within a node or a group, while global checkpointing
picks jobs to backfilled from the front of the queue. Once they
copies the whole application’s memory and it requires global
find a job that fit the hole in the schedule, they will backfill
consistency. However, local checkpointing can only recover
this job immediately. However, this selection approach may
local hardware failures. If the cascading failures happened,
miss the best matching jobs. In addition, to avoid jobs to be
the application has to roll back to the latest global checkpoint.
killed by systems due to underestimation, methods are used to
Because checkpointing consumes resource times, it needs to
correct prediction adaptively and this approach allows users to
be considered into scheduling. For example, when users make
provide more accurate runtime estimates.
their runtime estimates on their jobs, if they use checkpoints,
they need to take the time consumed by checkpoints into G. Scheduling Moldable and Malleable Jobs
consideration. In addition, checkpointing operation consumes
Based on who decides the number of nodes and when it is
system resources, such as memory for local checkpointing
decided, HPC jobs can be classified into four categories: rigid,
and network, I/O and storage resources. When users reserve
evolving, moldable, and malleable. For rigid and evolving
their resources at job submission, they need to estimate the
jobs, users decide how many nodes to be used. For rigid
additional resources consumed by checkpointing operations.
jobs, the decisions are made at submission, while for evolving
jobs, users can change their node requests during execution.
F. Backfilling and User Runtime Estimates
For moldable and malleable jobs, users specify a range of
Backfilling is a common strategy used by production HPC nodes a job can be run on and the scheduler decides how
facilities. The widely known backfilling strategies are EASY many nodes to be used. For moldable jobs, the scheduler
backfilling and conservative backfilling. EASY backfilling is makes decisions at submission. Malleable jobs are those which
the easiest to implement and it produces good scheduling can dynamically shrink or expand resources on which they
results on production systems, and therefore it is the most are executing at runtime. Executing moldable and malleable
jobs can potentially improve system utilization and reduce the provision to execute tasks which are scheduled to run
average response time [38]. Executing those jobs are chal- periodically.
lenging for HPC systems and schedulers. At present, most
III. OVERVIEW OF P ROPOSED J OB S CHEDULING
HPC facilities do not support moldable and malleable jobs.
F RAMEWORK
First, the nature of HPC applications makes it difficult to
change job size during execution. Most HPC applications The challenges faced in today’s job scheduling in HPC
have intensive communication between nodes, which makes and data centers demand more intelligent schedulers to make
dynamically changing job size very difficult [8], [9]. Second, smarter scheduling decisions. Therefore, I extend the modern
this requires HPC schedulers to be adaptive and monitor scheduling framework for job scheduling in HPC, which
system status. Therefore, enabling malleable jobs on HPC comprises of a resource manager, a job manager, a scheduling
systems demands HPC schedulers to do more jobs in a very decision maker, a system performance monitor, and a job per-
short time. Despite the forehead mentioned challenges, there formance monitor. The functions of these models are explained
are studies attempting to scheduling malleable jobs in order as follows:
to improve system performance [39], [40]. 1) Resource manager: Unlike the traditional resource man-
ager which only manages nodes, the next-generation
H. Workflow Scheduling resource managers are responsible to monitor the status of
Data-intensive data analysis applications often utilize a various schedulable resources in the HPC system, allocate
workflow that contains tens or even hundreds of tasks. Jobs resources to jobs, retrieve resource when job finished or
are made of stages, such as map or reduce, lined by data failed, and report abnormal resource behaviors.
dependencies. When a task has all the required input data 2) Job manager: Upon job submission, a job manager
ready, it will be allocated resources to execute. The input records the job’s basic information (such as user name,
data is stored in file systems and is divided into multiple project name) and its resource requirement (such as node
chunks, each of which is typically replicated three times requirement, the maximum time to run the job, and
across different machines. Therefore, executing these data- memory requirement). The job manager informs schedul-
dependent tasks need to follow the strict order feed by users. ing decision maker about the basic information of the
In addition, choosing the location for each task is critical incoming job and the scheduling decision maker orders
for efficiency in executing jobs. Schedulers prefer to execute jobs in the waiting queue based on user input and current
tasks on the machines that have a copy of the input data, status in the waiting queue. In addition, the job manager is
because local access of input data could save the time on responsible to monitor the job status changes and update
transfer input on network. It is also beneficial to the whole the job status based on the information provided by the
systems, because it reduces the amount of data moved in resource manager, scheduling decision maker and OS
a global network [41]. Therefore, there are some studies system.
concentrating on improving data locality in scheduling [42]– 3) Scheduling decision maker: To meet the challenges in
[45]. For example, Quincy attempts to balance between latency modern job scheduling in HPC, the next-generation de-
and data locality of all runnable tasks [46]. The data locality cision maker needs to make smart scheduling decision
of MapReduce jobs can be improved by scheduling both map based on the flexible requirement from the system ad-
and reduce tasks of one job on the same rack. Corral achieves ministrators. Different HPC systems and data centers
better data locality by coupling the placement of data and have their unique goals. For example, some systems aim
compute nodes. Data analysis applications are often delay- to provide fast response time to time-sensitive appli-
sensitive, which means it is crucial for meeting the deadlines cations, and the decision maker is supposed to adopt
of these workflows. Workflow scheduling problem is known to the scheduling algorithms that give high priorities to
be NP-complete in general. Scheduling algorithms often utilize the time-sensitive applications and reserve a portion of
heuristics and optimization techniques to try to obtain a near system resources to meet the burst of the time-sensitive
optimal scheduling decision. Hadoop is a popular map-reduce applications. If an HPC system concentrates on running
implementation deal with independent map-reduce tasks. To big jobs, the scheduler needs to focus on assign big jobs
meet deadline satisfaction, a large body of studies concentrate to the allocation with minimal interference from other
on giving higher priorities to time-sensitive tasks and delay jobs. A scheduler used in a production system is supposed
other tasks or jobs in a system so as to reduce deadline to be flexible to plugin other scheduling policies.
violations of time-sensitive tasks. Delay scheduling meet the 4) System performance monitor: If a system focuses on op-
deadline via another approach, which allows time-sensitive timizing its resource utilization, the system performance
tasks to wait for a certain amount of time and this increases the monitor is needed to monitor the current system status
chance of finding a better allocation for time-sensitive tasks and record and analyze the system past performance. The
that can store data locally [47]. Apache Oozie is a workflow purpose of monitoring and analysis is to alert system
scheduler for Hadoop jobs, which presents workflow tasks administrators and correct scheduling decisions if the
as Directed Acyclic Graphs (DAGs).Oozie combines multiple system performance does not reach the expectation [48],
jobs sequentially into one logical unit of work, and gives [49].
5) Job performance monitor: The job performance monitor [14] L. Yu, Z. Zhou, Y. Fan, M. Papka, and Z. Lan. System-wide Trade-off
record and report abnormal job behaviors, such as abnor- Modeling of Performance, Power, and Resilience on Petascale Systems.
In The Journal of Supercomputing, 2018.
mal exit from the system, and long job wait time and long [15] T. Patki, T. Lowenthal, A. Sasidharan, M. Maiterth, B. Rountree,
job running time. The analysis can also be conducted on M. Schulz, and Supinski. B. Practical Resource Management in Power-
user or project based, so the system administrators can Constrained, High Performance Computing. In HPDC, 2015.
[16] O. Mammela, M. Majanen, R. Basmadjian, H. Meer, A. Giesler, and
find what kind of user behavior or application source code W. Homberg. Energy-aware job scheduler for high-performance com-
can cause the degradation of job performance. puting. In Comput. Sci., 2012.
In summary, an intelligent job scheduler is the trend in the [17] M. Guzek, D. Kliazovich, and P. Bouvry. HEROS: Energy-Efficient
Load Balancing for Heterogeneous Data Centers. In IEEE 8th Interna-
future. The main difference of the intelligent job scheduler tional Conference on Cloud Computing, 2015.
and the traditional job scheduler is that the intelligent job [18] S. Wallace, X. Yang, V. Vishwanath, W. Allcock, S. Coghlan, M. Papka,
scheduler is capable of monitoring job and system status and Z. Lan. A Data Driven Scheduling Approach for Power Management
on HPC Systems. SC’16.
and therefore provide feedback to system administrators and [19] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and
schedulers itself to make adjustment accordingly. Stoica I. Dominant resource fairness: fair allocation of multiple resource
types. In NSDI, 2011.
IV. C ONCLUSION [20] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella.
Multi-resource Packing for Cluster Schedulers. SIGCOMM’14.
Job scheduling in HPC systems and data centers is one of [21] Y. Fan and Z. Lan. Exploiting Multi-Resource Scheduling for HPC. In
the active research fields which plays a crucial role in effec- SC Poster, 2019.
tive utilization HPC and data center resources and efficient [22] J. Lee, B. Nam, and H. Yoo. Dynamic Voltage and Frequency Scaling
(DVFS) scheme for multi-domains power management. In IEEE Asian
execution of jobs. In this chapter, I reviewed the challenges Solid-State Circuits Conference, 2007.
faced by HPC job scheduling and the approaches adopted [23] A. Pahlevan, M. Rossi, P. Valle, D. Brunelli, and D. Atienza. Joint
by schedulers to alleviate these problems. From the literature Computing and Electric Systems Optimization for Green Datacenters.
In Handbook of Hardware/Software Codesign, 2017.
review, I found that the current HPC job scheduling framework [24] F. Kong and X. Liu. A Survey on Green-Energy-Aware Power Manage-
is not smart to address various challenges. Therefore, I propose ment for Datacenters. In ACM Computing Surveys, 2014.
an intelligent HPC job scheduling framework to monitor the [25] V. Devabhaktuni, M. Alam, Shekara S., S. R. Depuru, R. C. Green,
D. Nims, and Near. Solar energy: Trends and enabling technologies. In
abnormal behaviors and performance in an HPC system and Renewable and Sustainable Energy Reviews, 2013.
improve system and job performance dynamically. [26] S. K. Garg, C. S. Yeo, A. Anandasivam, and R. Buyya. Energy- Efficient
Scheduling of HPC Applications in Cloud Computing Environments. In
R EFERENCES CoRR, 2009.
[1] B. Li, S. Chunduri, K. Harms, Y. Fan, and Z. Lan. The Effect of System [27] Í. Goiri, K. Le, M. E. Haque, R. Beauchea, T. D. Nguyen, J. Guitart,
Utilization on Application Performance Variability. In ROSS, 2019. J. Torres, and R Bianchini. GreenSlot: scheduling energy consumption
[2] W. Allcock, P. Rich, Y. Fan, and Z. Lan. Experience and Practice of in green datacenters. In Proceedings of 2011 International Conference
Batch Scheduling on Leadership Supercomputers at Argonne. In JSSPP, for High Performance Computing, Networking, Storage and Analysis
2017. (SC), 2011.
[3] D. Tsafrir, Y. Etsion, and D. Feitelson. Backfilling Using System- [28] D. Kliazovich, P. Bouvry, Y. Audzevich, and S. U. Khan. GreenCloud:
Generated Predications Rather Than User Runtime Estimates. In TPDS, A Packet-Level Simulator of Energy-Aware Cloud Computing Data
2007. Centers. In IEEE Global Telecommunications Conference GLOBECOM,
[4] D. Tsafrir and D. G. Feitelson. The Dynamics of Backfilling: Solving the 2010.
Mystery of Why Increased Inaccuracy May Help. In IEEE International [29] W. Tang, Z. Lan, N. Desai, and D. Buettner. Fault-aware, utility-
Symposium on Workload Characterization, 2006. based job scheduling on Blue, Gene/P systems. In IEEE International
[5] D. Ahn, J. Garlick, M. Grondona, D. Lipari, Springmeyer B., and Schulz Conference on Cluster Computing and Workshops, 2009.
M. Flux: A Next-Generation Resource Management Framework for [30] K. Vinay and S. M. D. Kumar. Fault-Tolerant Scheduling for Scientific
Large HPC Centers. In 43rd International Conference on Parallel Workflows in Cloud Environments. In IEEE 7th International Advance
Processing Workshops, 2014. Computing Conference (IACC), 2017.
[6] F. R. Dogar, T. Karagiannis, H. Ballani, and Rowstron. Decentralized [31] D. Kumar, Z. Shae, and H. Jamjoom. Scheduling Batch and Heteroge-
task-aware scheduling for data center networks. In SIGCOMM Comput. neous Jobs with Runtime Elasticity in a Parallel Processing Environment.
Commun., 2014. In IPDPS PhD Forum, 2012.
[7] C. Hung, L. Golubchik, and M. Yu. Scheduling jobs across geo- [32] K. C. Webb, A. C. Snoeren, and K. Yocum. Topology switching for data
distributed datacenters. In Proceedings of the Sixth ACM Symposium center networks. In Proceedings of the 11th USENIX conference on Hot
on Cloud Computing (SoCC), 2015. topics in management of internet, cloud, and enterprise networks and
[8] P. Qiao, X. Wang, X. Yang, Y. Fan, and Z. Lan. Preliminary Interference services (Hot-ICE), 2011.
Study About Job Placement and Routing Algorithms in the Fat-Tree [33] C. Zimmer, D. Maxwell, S. McNally, S. Atchley, and S. S. Vazhkudai.
Topology for HPC Applications. In CLUSTER, 2017. GPU age-aware scheduling to improve the reliability of leadership jobs
[9] P. Qiao, X. Wang, X. Yang, Y. Fan, and Z. Lan. Joint Effects on Titan. In Proceedings of the International Conference for High
of Application Communication Pattern, Job Placement and Network Performance Computing, Networking, Storage, and Analysis, 2018.
Routing on Fat-Tree Systems. In ICPP Workshops, 2018. [34] L. Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and
[10] Y. Fan, Z. Lan, P. Rich, W. Allcock, M. Papka, B. Austin, and D. Paul. S Matsuoka. FTI: High performance Fault Tolerance Interface for hybrid
Scheduling Beyond CPUs for HPC. In HPDC, 2019. systems. In Proceedings of 2011 International Conference for High
[11] Y. Fan, P. Rich, W. Allcock, M. Papka, and Z. Lan. ROME: A Multi- Performance Computing, Networking, Storage and Analysis (SC), 2011.
Resource Job Scheduling Framework for Exascale HPC System. In [35] Y. Fan. Application Checkpoint and Power Study on Large Scale
IPDPS poster, 2018. Systems. In IIT Tech. Report, 2021.
[12] Z. Zhou, Z. Lan, W. Tang, and N. L. Desai. Reducing energy costs for [36] Y. Fan, P. Rich, W. Allcock, M. Papka, and Z. Lan. Trade-Off
IBM Blue Gene/P via Power-Award job scheduling. In Proceedings of Between Prediction Accuracy and Underestimation Rate in Job Runtime
the 17th Workshop on Job Scheduling Strategies for Parallel Processors, Estimates. In CLUSTER, 2017.
2013. [37] E. Gaussier, Glesser D., V. Reis, and D. Trystram. Improving Backfilling
[13] X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and by using Machine Learning to predict Running Times. In SC, 2015.
M. Papka. Integrating Dynamic Pricing of Electricity into Energy Aware
Scheduling for HPC Systems. SC’13.
[38] Y. Fan, P. Rich, W. Allcock, M. Papka, and Z. Lan. Hybrid Workload Parallel and Distributed Processing Symposium Workshops (IPDPSW,
Scheduling on HPC Systems. In Advances in Computer and Network 2018.
Simulation and Modelling, 2021. [44] M. Masdari, S. ValiKardan, Z. Shahi, and S. Azar. Towards Workflow
[39] S. Gupta, T. Patel, C. Engelmann, and D. Tiwari. Failures in Large Scheduling In Cloud Computing: A Comprehensive Analysis. In Journal
Scale Systems: Long-Term Measurement, Analysis, and Implications. of Network and Computer Applications, 2016.
SC, 2017. [45] T. Estrada, J. Benson, and V. Pallipuram. A comprehensive study of
[40] Sadykov R. On scheduling malleable jobs to minimise the total weighted elasticity for DAG workflows in Cloud environments. In University of
completion time. In 13th IFAC Symposium on Information Control Delaware Technical Report, 2015.
Problems in Manufacturin, 2009. [46] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Gold-
[41] V. Jalaparti, P. Bodik, I. Menache, S. Rao, K. Makarychev, and M. Cae- berg. Quincy: Fair Scheduling for Distributed Computing Clusters. In
sar. Network-Aware Scheduling for Data-Parallel Jobs: Plan When You SOSP, 2009.
Can. In SIGCOMM Comput. Commun., 2015. [47] M. Zaharia, D. Borthakur, J. Sarma, K. Elmeleegy, S. Shenker, and
[42] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar. I. Stoica. Delay scheduling: A simple technique for achieving locality
ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce and fairness in cluster scheduling. In ACM EuroSys, 2010.
Clusters. In USENIX ATC, 2014. [48] Y. Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, and M. Papka. Deep
[43] Y. Caniou, E. Caron, A. K. W. Chang, and Y Robert. Budget-Aware Reinforcement Agent for Scheduling in HPC. In IPDPS, 2021.
Scheduling Algorithms for Scientific Workflows with Stochastic Task [49] Y. Fan and Z. Lan. DRAS-CQSim: A Reinforcement Learning based
Weights on Heterogeneous IaaS Cloud Platforms. In IEEE International Framework for HPC Cluster Scheduling. In Software Impacts, 2021.