0% found this document useful (0 votes)

109 views

Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc

NERSC operates two supercomputers, Edison and Cori, that are used by over 7,750 active users running over 700 codes. Edison has over 5,500 nodes while Cori has over 1,600 nodes. NERSC uses the Slurm workload manager to schedule jobs on these systems. Native Slurm is used to enable direct support for serial jobs, simplify operations, and enable new features. NERSC often patches Slurm and developed techniques to upgrade Slurm without needing to reboot the entire supercomputer. Scaling jobs to over 50,000 MPI ranks requires techniques like broadcasting executables to compute nodes due to filesystem limitations. Cori's partitions include realtime, shared, burst buffer, debug, and regular

Uploaded by

Sathishkumar Ranganathan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views

Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc

Uploaded by

Sathishkumar Ranganathan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

SLURM.

Our Way.
Douglas Jacobsen, James Botts, Helen He
NERSC
CUG 2016
NERSC Vital Statistics
● 860 active projects
○ DOE selects projects and PIs, allocates most of our computer
time
● 7750 active users
● 700+ codes both established and in-development
● edison XC30, 5586 ivybridge nodes
○ Primarily used for large capability jobs
○ Small - midrange as well
○ Moved edison from Oakland, CA to Berkeley, CA in Dec 2015
● cori phase 1 XC40, 1628 haswell nodes
○ DataWarp
○ realtime jobs for experimental facilities
○ massive quantities of serial jobs
○ regular workload too
○ shifter
repurposed

Native SLURM at NERSC "net" node

Why native? slurmctld slurmctld

(backup) (primary)

1. Enables direct support for serial jobs slurmdbd

2. Simplifies operation by easing mysql
prolog/epilog access to compute nodes
3. Simplifies user experience
a. No shared batch-script nodes eslogin
eslogin
eslogin
b. Similar to other cluster systems
compute compute
4. Enables new features and functionality slurmd slurmd
on existing systems
5. Creates a "platform for innovation"
/opt/slurm/default /dsl/opt/slurm/default

slurm.conf ControlAddr slurm.conf ControlAddr rsip

overridden to force unset to allow slurmctld
slurmctld traffic over traffic to use ipogif0 owing
ethernet interface to lookup of nid0xxxx
hostname
ldap
Basic CLE 5.2 Deployment
Challenge: Upgrade native SLURM
Issue: Installed to /dsl/opt/slurm/<version>, with Original Method:
symlink to "default". /opt/slurm/15.08.xx_instTag_20150912xxxx
→ Changing symlink can have little impact on actual /opt/slurm/default -> /etc/alternatives/slurm
version "pointed to" on compute nodes /etc/alternatives/slurm -> /opt/slurm/15.08.
xx_...
Result: Often receive recommendation to reboot
supercomputer after upgrading.
Production Method:
/opt/slurm/15.08.xx_instTag_20150912xxxx
Challenge: NERSC patches SLURM often and is not /opt/slurm/default -> 15.08.xx_instTag_20150912xxxx
interested in rebooting
Issue: /dsl DVS mount attribute cache prevents AND
proper dereference of "default" symlink
Solution: mount /dsl/opt/slurm a second time with Compute node /etc/fstab:
short (15s) attrcache /opt/slurm /dsl/opt/slurm dvs \
Result: NERSC can live upgrade without rebooting path=/dsl/opt/slurm,nodename=<dslNidList>, \
<opts>, attrcache_timeout=15

Also moved slurm sysconfdir to /opt/slurm/etc,

where etc is a symlink to conf.<rev> to
workaround a rare dvs issue
Scaling Up
Challenge: Small and mid-scale jobs work great! compute
When MPI ranks exceed ~50,000 sometimes users get:
Sun Jan 24 04:51:29 2016: [unset]:_pmi_alps_get_apid:alps response not OKAY
Sun Jan 24 04:51:29 2016: [unset]:_pmi_init:_pmi_alps_init returned -1 compute
[Sun Jan 24 04:51:30 2016] [c3-0c2s9n3] Fatal error in MPI_Init: Other MPI
error, error stack:
MPIR_Init_thread(547):
MPID_Init(203).......: channel initialization failed
MPID_Init(584).......: PMI2 init failed: 1 compute
<repeat ad nauseum for every rank>
lustre

...
Workaround: Increase PMI timeout from 60s to something
bigger (app env): PMI_MMAP_SYNC_WAIT_TIME=300
compute
Problem: srun directly execs the application from the hosting filesystem
location. FS cannot deliver the application at scale. aprun would copy the
executable to in-memory filesystem by default.
Other scaling topics:
Solution: New 15.08 srun feature merging sbcast and srun ● srun ports for stdout/err
srun --bcast=/tmp/a.out ./mpi/a.out ● rsip port exhaustion
slurm 16.05 adds --compress option to deliver ● slurm.conf TreeWidth
executable in similar time as aprun ● Backfill tuning
"NERSC users run applications
at every scale to conduct their
Scheduling research."

Source: Brian Austin, NERSC

edison

Scheduling ● big job metric - need to always be running

at least one "large" job (>682 nodes)
○ Give priority boost + discount
cori
cori+edison
● "shared" partition
○ Up to 32 jobs per node ● debug partition
○ HINT: set --gres=craynetwork:0 in ○ delivers debug-exclusive nodes
job_submit.lua for shared jobs ○ more exclusive nodes during business
○ allow users to submit 10,000 jobs with up hours
to 1,000 concurrently running ● regular partition
● "realtime" partition ○ Highly utilized workhorse
○ Jobs must start within 2 minutes ● low and premium QOS
○ Per-project limits implemented using QOS
○ accessible in most partitions
○ Top priority jobs + exclusive access to
● scavenger QOS
small number of nodes (92% utilized)
○ Once a user account balance drops below
● burstbuffer QOS gives constant priority
zero, all jobs automatically put into
boost to burst buffer jobs
scavenger. Eligible for all partitions
except realtime
Scheduling - How Debug Works
Nights and Weekends
regular
debug
nid00008 nid05586

Business Hours
regular
debug
nid00008 nid05586

Debug jobs: Day/Night:

● are smaller than "regular" jobs ● cron-run script manipulates regular
● are shorter than "regular" jobs partition configuration (scontrol update
● have access to all nodes in the system partition=regular…)
● have advantageous priority ● during night mode adds a reservation to
prevent long running jobs from starting
these concepts are extended for cori's on contended nodes
realtime and shared partitions
now

Scheduling - Backfill

jobs
● NERSC typically has hundreds of
running jobs (thousands on cori) time
● Queue frequently 10x larger (2,000 -
10,000 eligible jobs) New Backfill Algorithm!

and so on...
● Much parameter optimization required bf_min_prio_reserve
to get things "working"
○ bf_interval 1. choose particular priority value
○ bf_max_job_partition as threshold
○ bf_max_job_user 2. Everything above threshold gets
○ … resource reservations
● We still weren't getting our target 3. Everything below is evaluated
utilization (>95%) with simple "start now" check
● Still were having long waits with many (NEW for SLURM)
backfill targets in the queue
Utilization jumped on average more
than 7% per day
Job Prioritization
1. QOS
Every backfill opportunity is realized
2. Aging (scaled to 1 point per minute)
3. Fairshare (up to 1440 points)
Primary Difficulty Faced

xt
ch
ec
kh
e al
th
xt
ch
ec
kh
e al
slurmctld

t
needs to become

h
xtcleanup_after slurmctld

xt
xtcleanup_after

che
c kh
ea
lt
h
Issue is that a "completing" node, stuck
...

xt
...
on unkillable process (or other similar

ch
ec
issue), becomes an emergency

kh
ea
lt
h
NHC doesn't run until entire allocation If NHC is run from per-node epilog, each node
has ended. In cases slow-to-complete can complete independently, returning them to
node, this holds large allocations idle. service faster.
Exciting slurm topics I'm not covering today
user training and tutorials

accounting/integrating slurmdbd with NERSC databases

user experience and documentation draining dvs service
nodes with prolog
my speculations about Rhine/Redwood
blowing up slurm
details of realtime implementation
without getting burned
burstbuffer / DataWarp integration
NERSC slurm plugins: vtune, blcr, shifter, completion
ccm
reservations job_submit.lua
monitoring knl
Conclusions and Future Directions
● We have consistently delivered ● Integrating Cori Phase 2 (+9300
highly usable systems with SLURM KNL)
since it was put on the systems ○ 11,000 node system

● Our typical experience is that bugs ○ New processor requiring new NUMA

are repaired same-or-next day binding capabilities, node reboot

capabilities,
● Native SLURM is a new technology
● Deploying SLURM on
that has rough edges with great
Rhine/Redwood
opportunity!
○ Continuous delivery of configurations
● Increasing resolution of binding ○ Live rebuild/redeploy (less frequent)
affinities ● Scaling topologically aware
scheduling
Acknowledgements
NERSC SchedMD
● Tina Declerck ● Moe Jette
● Ian Nascimento ● Danny Auble
● Stephen Leak ● Tim Wickberg
Cray ● Brian Christiansen

● Brian Gilmer

QEMU - Crash Course Wiki
No ratings yet
QEMU - Crash Course Wiki
6 pages
SGOS 6.1.x Administration Guide.9
No ratings yet
SGOS 6.1.x Administration Guide.9
1,537 pages
primerHW SWinterface
No ratings yet
primerHW SWinterface
107 pages
AZURE SQL Database
100% (1)
AZURE SQL Database
23 pages
S L U R M: Imple Inux Tility For Esource Anagement
No ratings yet
S L U R M: Imple Inux Tility For Esource Anagement
21 pages
II Slurm Overview
No ratings yet
II Slurm Overview
52 pages
Slurm Talk
No ratings yet
Slurm Talk
40 pages
Slurm Guide
No ratings yet
Slurm Guide
78 pages
Slurm ParallelCluster AWS
No ratings yet
Slurm ParallelCluster AWS
3 pages
User Guide Slurm
100% (2)
User Guide Slurm
82 pages
Slurm Data Analysis
No ratings yet
Slurm Data Analysis
8 pages
Install - Guide Rocky8 Warewulf SLURM 2.5 Aarch64
No ratings yet
Install - Guide Rocky8 Warewulf SLURM 2.5 Aarch64
43 pages
Openhpc (V1.3.9) Cluster Building Recipes: Sles12Sp4 Base Os Warewulf/Slurm Edition For Linux (X86 64)
No ratings yet
Openhpc (V1.3.9) Cluster Building Recipes: Sles12Sp4 Base Os Warewulf/Slurm Edition For Linux (X86 64)
63 pages
01 Slurm14.3TrainingHands On
No ratings yet
01 Slurm14.3TrainingHands On
1 page
GPFS and HDFS
No ratings yet
GPFS and HDFS
5 pages
Improving Performance of 100G Data Transfer Nodes PDF
No ratings yet
Improving Performance of 100G Data Transfer Nodes PDF
48 pages
AWS Storage Use Cases
No ratings yet
AWS Storage Use Cases
12 pages
Solution Methodology2
No ratings yet
Solution Methodology2
3 pages
Process Management Interface - Exascale
No ratings yet
Process Management Interface - Exascale
19 pages
User Manual
No ratings yet
User Manual
116 pages
Preparing Your Network For: Digital Transformation
No ratings yet
Preparing Your Network For: Digital Transformation
6 pages
Lynxos
No ratings yet
Lynxos
4 pages
HPC Datasheet sc23 h200 Datasheet 3002446
No ratings yet
HPC Datasheet sc23 h200 Datasheet 3002446
3 pages
Implementation Guide X6 1.9.96 13
No ratings yet
Implementation Guide X6 1.9.96 13
244 pages
HPC Job
No ratings yet
HPC Job
8 pages
Mellanox OFED Linux User Manual v2.3-1.0.1
No ratings yet
Mellanox OFED Linux User Manual v2.3-1.0.1
207 pages
AWS Amazon VPC Connectivity Options PDF
No ratings yet
AWS Amazon VPC Connectivity Options PDF
31 pages
Useful Linux Wireless Commands
No ratings yet
Useful Linux Wireless Commands
22 pages
Intro HPC Linux Gent
No ratings yet
Intro HPC Linux Gent
124 pages
GPGPU
No ratings yet
GPGPU
139 pages
Reference Architectures 2017: Spring Boot Microservices On Red Hat Openshift Container Platform 3
No ratings yet
Reference Architectures 2017: Spring Boot Microservices On Red Hat Openshift Container Platform 3
47 pages
3.1.1 Layered Cloud Architecture Design
No ratings yet
3.1.1 Layered Cloud Architecture Design
8 pages
Nvidia Quadro Sales Guide
No ratings yet
Nvidia Quadro Sales Guide
16 pages
6WIND-Intel White Paper - Optimized Data Plane Processing Solutions Using The Intel® DPDK v2
No ratings yet
6WIND-Intel White Paper - Optimized Data Plane Processing Solutions Using The Intel® DPDK v2
8 pages
Using FFmpeg With NVIDIA GPU Hardware Acceleration
No ratings yet
Using FFmpeg With NVIDIA GPU Hardware Acceleration
22 pages
An Introduction To GPFS Version 3.2: September, 2007
No ratings yet
An Introduction To GPFS Version 3.2: September, 2007
17 pages
Microsoft in High Performance Computing: An Introduction: Aditya Krishnan Technical Product Manager Microsoft Corp
No ratings yet
Microsoft in High Performance Computing: An Introduction: Aditya Krishnan Technical Product Manager Microsoft Corp
21 pages
Install - Guide CentOS7 xCAT Stateful SLURM 1.3.9 x86 - 64
No ratings yet
Install - Guide CentOS7 xCAT Stateful SLURM 1.3.9 x86 - 64
57 pages
NVIDIA Techies Guide To Ethernet - Storage - Fabrics
100% (1)
NVIDIA Techies Guide To Ethernet - Storage - Fabrics
64 pages
Quick HOWTO: Ch1: Network Backups With Rancid: Forums - Corrections - About - (C) Peter Harrison
100% (1)
Quick HOWTO: Ch1: Network Backups With Rancid: Forums - Corrections - About - (C) Peter Harrison
19 pages
Powerha Systemmirror For Aix V7.1 Two-Node Quick Configuration Guide
No ratings yet
Powerha Systemmirror For Aix V7.1 Two-Node Quick Configuration Guide
34 pages
Dell R730xd RedHat Ceph Performance SizingGuide WhitePaper
No ratings yet
Dell R730xd RedHat Ceph Performance SizingGuide WhitePaper
37 pages
Glade Tutorial
No ratings yet
Glade Tutorial
36 pages
Netweb HPC Profile - Cloud Services
No ratings yet
Netweb HPC Profile - Cloud Services
43 pages
Differentiated I/O Services in Virtualized Environments: Tyler Harter, Salini SK & Anand Krishnamurthy
No ratings yet
Differentiated I/O Services in Virtualized Environments: Tyler Harter, Salini SK & Anand Krishnamurthy
44 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Deploying Postgres
No ratings yet
Deploying Postgres
15 pages
OSP8 Roadmap
No ratings yet
OSP8 Roadmap
84 pages
CitrixVmware GPU Deployment Guide TechPub v02d6 Final
No ratings yet
CitrixVmware GPU Deployment Guide TechPub v02d6 Final
302 pages
Immediate Joiner
No ratings yet
Immediate Joiner
9 pages
Glade Tutorial
No ratings yet
Glade Tutorial
5 pages
HPC - Unit Test-I (9 July 2020) : Mark Only One Oval
No ratings yet
HPC - Unit Test-I (9 July 2020) : Mark Only One Oval
5 pages
CIS Red Hat OpenShift Container Platform Benchmark V1.6.0 PDF
No ratings yet
CIS Red Hat OpenShift Container Platform Benchmark V1.6.0 PDF
354 pages
dgx2 User Guide
No ratings yet
dgx2 User Guide
125 pages
OpenShift - Container - Platform 4.17 Web - Console en US
No ratings yet
OpenShift - Container - Platform 4.17 Web - Console en US
145 pages
Newest NCP-AIO Exam Dumps
No ratings yet
Newest NCP-AIO Exam Dumps
16 pages
Hashicorp Terraform Associate Certification (Exam 003)
From Everand
Hashicorp Terraform Associate Certification (Exam 003)
Kimiko Lee
No ratings yet
Mastering Ninject for Dependency Injection
From Everand
Mastering Ninject for Dependency Injection
Daniel Baharestani
No ratings yet
About Kubernetes and Security Practices - Short Edition: First Edition, #1
From Everand
About Kubernetes and Security Practices - Short Edition: First Edition, #1
Ami Adi
No ratings yet
Doing More With Slurm Advanced Capabilities
No ratings yet
Doing More With Slurm Advanced Capabilities
31 pages
E5en H Onrom
No ratings yet
E5en H Onrom
374 pages
South East Asian Institute of Technology, INC. National Highway, Crossing Rubber, Tupi, South Cotabato
No ratings yet
South East Asian Institute of Technology, INC. National Highway, Crossing Rubber, Tupi, South Cotabato
14 pages
Strama 1-7
No ratings yet
Strama 1-7
22 pages
Chippy S World: Game Design Document "Catch For Your Life"
No ratings yet
Chippy S World: Game Design Document "Catch For Your Life"
22 pages
KSG-4.6K-DM / KSG-4.9K-DM / KSG-5K-DM / KSG-5.2K-DM: String Grid-Tied PV Inverter NG Grid-Tied PV Inverter
No ratings yet
KSG-4.6K-DM / KSG-4.9K-DM / KSG-5K-DM / KSG-5.2K-DM: String Grid-Tied PV Inverter NG Grid-Tied PV Inverter
1 page
Progress Zenon Projects To FDA 21 CFR Part 11 Compliance
No ratings yet
Progress Zenon Projects To FDA 21 CFR Part 11 Compliance
22 pages
Oillab 710 Asvp - Air Satured Vapour Pressure: Automatic Analysers: Oillab Range
No ratings yet
Oillab 710 Asvp - Air Satured Vapour Pressure: Automatic Analysers: Oillab Range
1 page
Maxserver Getting Started
No ratings yet
Maxserver Getting Started
164 pages
Exp Lab Activity 1 - Basic Gates: Objective
No ratings yet
Exp Lab Activity 1 - Basic Gates: Objective
8 pages
Front Office
No ratings yet
Front Office
86 pages
双倍行距参考文献
100% (2)
双倍行距参考文献
12 pages
UNIT I INTRODUCTION TO WEARABLE SYSTEMS AND SENSORS
100% (1)
UNIT I INTRODUCTION TO WEARABLE SYSTEMS AND SENSORS
99 pages
SLM CSS M5
No ratings yet
SLM CSS M5
24 pages
rs485 1m2s Extension Module Manual
No ratings yet
rs485 1m2s Extension Module Manual
6 pages
7 8 Week3 Technology Algorithms - Robotics
No ratings yet
7 8 Week3 Technology Algorithms - Robotics
21 pages
Tìm hiểu GAE 01 - Building.High-Perf
No ratings yet
Tìm hiểu GAE 01 - Building.High-Perf
41 pages
Log
No ratings yet
Log
46 pages
71 0245RK
No ratings yet
71 0245RK
92 pages
Healthcare Internet of Things Security Threats Challenges and Future Research Directions
No ratings yet
Healthcare Internet of Things Security Threats Challenges and Future Research Directions
26 pages
Mobile Application Programming: 01 Introduction To The Android Operating System
No ratings yet
Mobile Application Programming: 01 Introduction To The Android Operating System
34 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
OBJECT Oriented Databases
No ratings yet
OBJECT Oriented Databases
7 pages
s7-1200 PN DP Gateway en
No ratings yet
s7-1200 PN DP Gateway en
28 pages
Touch Pay
No ratings yet
Touch Pay
23 pages
VDT Project 2021
No ratings yet
VDT Project 2021
3 pages
10 - 19UCSPC503 - B - 4 - 14Module-UNIT 2
No ratings yet
10 - 19UCSPC503 - B - 4 - 14Module-UNIT 2
25 pages
Stockyard Report Management: Stock in Hand
No ratings yet
Stockyard Report Management: Stock in Hand
2 pages
Inspiron 23 Service Manual: Computer Model: Inspiron 2350 Regulatory Model: W07C Regulatory Type: W07C002
No ratings yet
Inspiron 23 Service Manual: Computer Model: Inspiron 2350 Regulatory Model: W07C Regulatory Type: W07C002
106 pages
E-Mail Writing Exercises (1)
No ratings yet
E-Mail Writing Exercises (1)
8 pages

Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc

Uploaded by

Slurm. Our Way.: Douglas Jacobsen, James Botts, Helen He Nersc

Uploaded by

SLURM.

Native SLURM at NERSC "net" node

Why native? slurmctld slurmctld

1. Enables direct support for serial jobs slurmdbd

slurm.conf ControlAddr slurm.conf ControlAddr rsip

Also moved slurm sysconfdir to /opt/slurm/etc,

Source: Brian Austin, NERSC

Scheduling ● big job metric - need to always be running

Debug jobs: Day/Night:

accounting/integrating slurmdbd with NERSC databases

are repaired same-or-next day binding capabilities, node reboot

You might also like