0% found this document useful (0 votes)
109 views

Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc

NERSC operates two supercomputers, Edison and Cori, that are used by over 7,750 active users running over 700 codes. Edison has over 5,500 nodes while Cori has over 1,600 nodes. NERSC uses the Slurm workload manager to schedule jobs on these systems. Native Slurm is used to enable direct support for serial jobs, simplify operations, and enable new features. NERSC often patches Slurm and developed techniques to upgrade Slurm without needing to reboot the entire supercomputer. Scaling jobs to over 50,000 MPI ranks requires techniques like broadcasting executables to compute nodes due to filesystem limitations. Cori's partitions include realtime, shared, burst buffer, debug, and regular
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc

NERSC operates two supercomputers, Edison and Cori, that are used by over 7,750 active users running over 700 codes. Edison has over 5,500 nodes while Cori has over 1,600 nodes. NERSC uses the Slurm workload manager to schedule jobs on these systems. Native Slurm is used to enable direct support for serial jobs, simplify operations, and enable new features. NERSC often patches Slurm and developed techniques to upgrade Slurm without needing to reboot the entire supercomputer. Scaling jobs to over 50,000 MPI ranks requires techniques like broadcasting executables to compute nodes due to filesystem limitations. Cori's partitions include realtime, shared, burst buffer, debug, and regular
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

SLURM.

Our Way.
Douglas Jacobsen, James Botts, Helen He
NERSC
CUG 2016
NERSC Vital Statistics
● 860 active projects
○ DOE selects projects and PIs, allocates most of our computer
time
● 7750 active users
● 700+ codes both established and in-development
● edison XC30, 5586 ivybridge nodes
○ Primarily used for large capability jobs
○ Small - midrange as well
○ Moved edison from Oakland, CA to Berkeley, CA in Dec 2015
● cori phase 1 XC40, 1628 haswell nodes
○ DataWarp
○ realtime jobs for experimental facilities
○ massive quantities of serial jobs
○ regular workload too
○ shifter
repurposed

Native SLURM at NERSC "net" node

Why native? slurmctld slurmctld


(backup) (primary)

1. Enables direct support for serial jobs slurmdbd


2. Simplifies operation by easing mysql
prolog/epilog access to compute nodes
3. Simplifies user experience
a. No shared batch-script nodes eslogin
eslogin
eslogin
b. Similar to other cluster systems
compute compute
4. Enables new features and functionality slurmd slurmd
on existing systems
5. Creates a "platform for innovation"
/opt/slurm/default /dsl/opt/slurm/default

slurm.conf ControlAddr slurm.conf ControlAddr rsip


overridden to force unset to allow slurmctld
slurmctld traffic over traffic to use ipogif0 owing
ethernet interface to lookup of nid0xxxx
hostname
ldap
Basic CLE 5.2 Deployment
Challenge: Upgrade native SLURM
Issue: Installed to /dsl/opt/slurm/<version>, with Original Method:
symlink to "default". /opt/slurm/15.08.xx_instTag_20150912xxxx
→ Changing symlink can have little impact on actual /opt/slurm/default -> /etc/alternatives/slurm
version "pointed to" on compute nodes /etc/alternatives/slurm -> /opt/slurm/15.08.
xx_...
Result: Often receive recommendation to reboot
supercomputer after upgrading.
Production Method:
/opt/slurm/15.08.xx_instTag_20150912xxxx
Challenge: NERSC patches SLURM often and is not /opt/slurm/default -> 15.08.xx_instTag_20150912xxxx
interested in rebooting
Issue: /dsl DVS mount attribute cache prevents AND
proper dereference of "default" symlink
Solution: mount /dsl/opt/slurm a second time with Compute node /etc/fstab:
short (15s) attrcache /opt/slurm /dsl/opt/slurm dvs \
Result: NERSC can live upgrade without rebooting path=/dsl/opt/slurm,nodename=<dslNidList>, \
<opts>, attrcache_timeout=15

Also moved slurm sysconfdir to /opt/slurm/etc,


where etc is a symlink to conf.<rev> to
workaround a rare dvs issue
Scaling Up
Challenge: Small and mid-scale jobs work great! compute
When MPI ranks exceed ~50,000 sometimes users get:
Sun Jan 24 04:51:29 2016: [unset]:_pmi_alps_get_apid:alps response not OKAY
Sun Jan 24 04:51:29 2016: [unset]:_pmi_init:_pmi_alps_init returned -1 compute
[Sun Jan 24 04:51:30 2016] [c3-0c2s9n3] Fatal error in MPI_Init: Other MPI
error, error stack:
MPIR_Init_thread(547):
MPID_Init(203).......: channel initialization failed
MPID_Init(584).......: PMI2 init failed: 1 compute
<repeat ad nauseum for every rank>
lustre

...
Workaround: Increase PMI timeout from 60s to something
bigger (app env): PMI_MMAP_SYNC_WAIT_TIME=300
compute
Problem: srun directly execs the application from the hosting filesystem
location. FS cannot deliver the application at scale. aprun would copy the
executable to in-memory filesystem by default.
Other scaling topics:
Solution: New 15.08 srun feature merging sbcast and srun ● srun ports for stdout/err
srun --bcast=/tmp/a.out ./mpi/a.out ● rsip port exhaustion
slurm 16.05 adds --compress option to deliver ● slurm.conf TreeWidth
executable in similar time as aprun ● Backfill tuning
"NERSC users run applications
at every scale to conduct their
Scheduling research."

Source: Brian Austin, NERSC


edison

Scheduling ● big job metric - need to always be running


at least one "large" job (>682 nodes)
○ Give priority boost + discount
cori
cori+edison
● "shared" partition
○ Up to 32 jobs per node ● debug partition
○ HINT: set --gres=craynetwork:0 in ○ delivers debug-exclusive nodes
job_submit.lua for shared jobs ○ more exclusive nodes during business
○ allow users to submit 10,000 jobs with up hours
to 1,000 concurrently running ● regular partition
● "realtime" partition ○ Highly utilized workhorse
○ Jobs must start within 2 minutes ● low and premium QOS
○ Per-project limits implemented using QOS
○ accessible in most partitions
○ Top priority jobs + exclusive access to
● scavenger QOS
small number of nodes (92% utilized)
○ Once a user account balance drops below
● burstbuffer QOS gives constant priority
zero, all jobs automatically put into
boost to burst buffer jobs
scavenger. Eligible for all partitions
except realtime
Scheduling - How Debug Works
Nights and Weekends
regular
debug
nid00008 nid05586

Business Hours
regular
debug
nid00008 nid05586

Debug jobs: Day/Night:


● are smaller than "regular" jobs ● cron-run script manipulates regular
● are shorter than "regular" jobs partition configuration (scontrol update
● have access to all nodes in the system partition=regular…)
● have advantageous priority ● during night mode adds a reservation to
prevent long running jobs from starting
these concepts are extended for cori's on contended nodes
realtime and shared partitions
now

Scheduling - Backfill

jobs
● NERSC typically has hundreds of
running jobs (thousands on cori) time
● Queue frequently 10x larger (2,000 -
10,000 eligible jobs) New Backfill Algorithm!

and so on...
● Much parameter optimization required bf_min_prio_reserve
to get things "working"
○ bf_interval 1. choose particular priority value
○ bf_max_job_partition as threshold
○ bf_max_job_user 2. Everything above threshold gets
○ … resource reservations
● We still weren't getting our target 3. Everything below is evaluated
utilization (>95%) with simple "start now" check
● Still were having long waits with many (NEW for SLURM)
backfill targets in the queue
Utilization jumped on average more
than 7% per day
Job Prioritization
1. QOS
Every backfill opportunity is realized
2. Aging (scaled to 1 point per minute)
3. Fairshare (up to 1440 points)
Primary Difficulty Faced

xt
ch
ec
kh
e al
th
xt
ch
ec
kh
e al
slurmctld

t
needs to become

h
xtcleanup_after slurmctld

xt
xtcleanup_after

che
c kh
ea
lt
h
Issue is that a "completing" node, stuck
...

xt
...
on unkillable process (or other similar

ch
ec
issue), becomes an emergency

kh
ea
lt
h
NHC doesn't run until entire allocation If NHC is run from per-node epilog, each node
has ended. In cases slow-to-complete can complete independently, returning them to
node, this holds large allocations idle. service faster.
Exciting slurm topics I'm not covering today
user training and tutorials

accounting/integrating slurmdbd with NERSC databases


user experience and documentation draining dvs service
nodes with prolog
my speculations about Rhine/Redwood
blowing up slurm
details of realtime implementation
without getting burned
burstbuffer / DataWarp integration
NERSC slurm plugins: vtune, blcr, shifter, completion
ccm
reservations job_submit.lua
monitoring knl
Conclusions and Future Directions
● We have consistently delivered ● Integrating Cori Phase 2 (+9300
highly usable systems with SLURM KNL)
since it was put on the systems ○ 11,000 node system

● Our typical experience is that bugs ○ New processor requiring new NUMA

are repaired same-or-next day binding capabilities, node reboot


capabilities,
● Native SLURM is a new technology
● Deploying SLURM on
that has rough edges with great
Rhine/Redwood
opportunity!
○ Continuous delivery of configurations
● Increasing resolution of binding ○ Live rebuild/redeploy (less frequent)
affinities ● Scaling topologically aware
scheduling
Acknowledgements
NERSC SchedMD
● Tina Declerck ● Moe Jette
● Ian Nascimento ● Danny Auble
● Stephen Leak ● Tim Wickberg
Cray ● Brian Christiansen

● Brian Gilmer

You might also like