Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc
Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc
Our Way.
Douglas Jacobsen, James Botts, Helen He
NERSC
CUG 2016
NERSC Vital Statistics
● 860 active projects
○ DOE selects projects and PIs, allocates most of our computer
time
● 7750 active users
● 700+ codes both established and in-development
● edison XC30, 5586 ivybridge nodes
○ Primarily used for large capability jobs
○ Small - midrange as well
○ Moved edison from Oakland, CA to Berkeley, CA in Dec 2015
● cori phase 1 XC40, 1628 haswell nodes
○ DataWarp
○ realtime jobs for experimental facilities
○ massive quantities of serial jobs
○ regular workload too
○ shifter
repurposed
...
Workaround: Increase PMI timeout from 60s to something
bigger (app env): PMI_MMAP_SYNC_WAIT_TIME=300
compute
Problem: srun directly execs the application from the hosting filesystem
location. FS cannot deliver the application at scale. aprun would copy the
executable to in-memory filesystem by default.
Other scaling topics:
Solution: New 15.08 srun feature merging sbcast and srun ● srun ports for stdout/err
srun --bcast=/tmp/a.out ./mpi/a.out ● rsip port exhaustion
slurm 16.05 adds --compress option to deliver ● slurm.conf TreeWidth
executable in similar time as aprun ● Backfill tuning
"NERSC users run applications
at every scale to conduct their
Scheduling research."
Business Hours
regular
debug
nid00008 nid05586
Scheduling - Backfill
jobs
● NERSC typically has hundreds of
running jobs (thousands on cori) time
● Queue frequently 10x larger (2,000 -
10,000 eligible jobs) New Backfill Algorithm!
and so on...
● Much parameter optimization required bf_min_prio_reserve
to get things "working"
○ bf_interval 1. choose particular priority value
○ bf_max_job_partition as threshold
○ bf_max_job_user 2. Everything above threshold gets
○ … resource reservations
● We still weren't getting our target 3. Everything below is evaluated
utilization (>95%) with simple "start now" check
● Still were having long waits with many (NEW for SLURM)
backfill targets in the queue
Utilization jumped on average more
than 7% per day
Job Prioritization
1. QOS
Every backfill opportunity is realized
2. Aging (scaled to 1 point per minute)
3. Fairshare (up to 1440 points)
Primary Difficulty Faced
xt
ch
ec
kh
e al
th
xt
ch
ec
kh
e al
slurmctld
t
needs to become
h
xtcleanup_after slurmctld
xt
xtcleanup_after
che
c kh
ea
lt
h
Issue is that a "completing" node, stuck
...
xt
...
on unkillable process (or other similar
ch
ec
issue), becomes an emergency
kh
ea
lt
h
NHC doesn't run until entire allocation If NHC is run from per-node epilog, each node
has ended. In cases slow-to-complete can complete independently, returning them to
node, this holds large allocations idle. service faster.
Exciting slurm topics I'm not covering today
user training and tutorials
● Our typical experience is that bugs ○ New processor requiring new NUMA
● Brian Gilmer