0% found this document useful (0 votes)

5 views

Slurm Usage Guide

Uploaded by

Le Truc Quynh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Slurm Usage Guide

Uploaded by

Le Truc Quynh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Slurm Usage Guide

Concept

SSH flow: Get into hanoi -> then go login-sp.vinai-systems.com

Login with account AD.

ssh hanoi
ssh <username>@login-sp.vinai-systems.com
Ex: ssh [email protected]

HOME_FOLDER_ISILON <=> /home/your_username (on loginNode) <=>

/vinai/your_username

SUPERPOD_STORAGE_DDN_FOLDER <=> /lustre/scratch/client (on all node)

PERSONAL_STORAGE_DDN_FOLDER <=>
/lustre/scratch/client/vinai/user/your_username

You have to put your training data in DDN Storage, HOME ISILON will be used for data
archive longterm.

Introduction
Slurm is an open-source job scheduling system for Linux clusters, most frequently used for
high-performance computing (HPC) applications. This guide will cover some of the basics to
get started using slurm as a user. For more information, the Slurm Docs are a good place to
start.

After slurm is deployed on a cluster, a slurmd daemon should be running on each compute
system. Users do not log directly into each compute system to do their work. Instead, they
execute slurm commands (ex: srun, sinfo, scancel, scontrol, etc) from a slurm login node.
These commands communicate with the slurmd daemons on each host to perform work.
Simple Commands
Cluster state with sinfo
To "see" the cluster, ssh to the slurm login node for your cluster and run the `sinfo`
command:
dgxuser@sdc2-hpc-login-mgmt001:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 1-00:00:00 8 idle sdc2-hpc-dgx-a100-[001-008]
batch* up 1-00:00:00 2 down sdc2-hpc-dgx-a100-[013,015]
There are 8 nodes available on this system, all in an idle state. If a node is busy, its state will
change from idle to alloc. If a node is down, its state will change from idle to down.
dgxuser@sdc2-hpc-login-mgmt001:~$ sinfo -lN
Fri Jul 16 10:47:52 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT
AVAIL_FE REASON
sdc2-hpc-dgx-a100-001 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-002 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-003 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-004 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-005 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-006 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-007 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-008 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-013 1 batch* down 256 2:64:2 103100 0 1 (null) VinAI use
sdc2-hpc-dgx-a100-015 1 batch* down 256 2:64:2 103100 0 1 (null) VinAI use

The `sinfo` command can be used to output a lot more information about the cluster. Check out
the sinfo doc for more information.

Running a job with srun

To run a job, use the srun command:
dgxuser@sdc2-hpc-login-mgmt001:~$ srun --partition=batch --gres=gpu:8 env | grep CUDA
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
dgxuser@sdc2-hpc-login-mgmt001:~$ srun --partition=batch --ntasks 8 -l hostname
5: sdc2-hpc-dgx-a100-001
2: sdc2-hpc-dgx-a100-001
7: sdc2-hpc-dgx-a100-001
6: sdc2-hpc-dgx-a100-001
0: sdc2-hpc-dgx-a100-001
3: sdc2-hpc-dgx-a100-001
1: sdc2-hpc-dgx-a100-001
4: sdc2-hpc-dgx-a100-001

Running an interactive job

Especially when developing and experimenting, it's helpful to run an interactive job, which
requests a resource and provides a command prompt as an interface to it (maxtime=2h):

dgxuser@sdc2-hpc-login-mgmt001:~$ srun --partition=batch --pty /bin/bash --time=02:00:00

dgxuser@sdc2-hpc-dgx-a100-001:~$ hostname
sdc2-hpc-dgx-a100-001
dgxuser@sdc2-hpc-dgx-a100-001:~$ exit

During interactive mode, the resource is being reserved for use until the prompt is exited (as
shown above). Commands can be run in succession.
Note: before starting an interactive session with srun it may be helpful to create a session
on the login node with a tool like tmux or `screen`. This will prevent a user from losing
interactive jobs if there is a network outage or the terminal is closed.
More Advanced Use

Run a batch job

While the srun command blocks any other execution in the terminal, sbatch can be run to queue
a job for execution once resources are available in the cluster. Also, a batch job will let you
queue up several jobs that run as nodes become available. It's therefore good practice to
encapsulate everything that needs to be run into a script and then execute with sbatch vs with
srun:
Example: running job python

dgxuser@sdc2-hpc-login-mgmt001:~$ cat script.sh

#!/bin/bash
set -e
#SBATCH --job-name=demo # create a short name for your job
#SBATCH --output=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.out #
create a output file
#SBATCH --error=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.err #
create a error file
#SBATCH --partition=batch or phase2 # choose partition
#SBATCH --gpus=1 # gpu count
#SBATCH --nodes=1 # node count
#SBATCH --mem-per-cpu=2G # memory per cpu-core (4G is default)
#SBATCH --cpus-per-gpu=8 # cpu-cores per gpu
#SBATCH --mail-type=all # option sendmail: begin,fail.end,requeue,all
#SBATCH [email protected] //your email
python3 demo.py
dgxuser@sdc2-hpc-login-mgmt001:~$ sbatch script.sh

Resources can be requested in several different ways:

sbatch/srun Option Description

-N, --nodes= Specify the total number of nodes to request
-n, --ntasks= Specify the total number of tasks to request
--ntasks-per-node= Specify the number of tasks per node
--gpus-per-node= Specify the number of GPUs to use Per node
-G, --gpus= Total number of GPUs to allocate for the job
--gpus-per-task= Number of gpus per task
--cpus-per-task= Number of cpus per task
--exclusive Guarantee that nodes are not shared amongst jobs

Observing running jobs with squeue

To see which jobs are running in the cluster, use the `squeue` command:
dgxuser@sdc2-hpc-login-mgmt001:~$ squeue -a -l
Fri Jul 16 11:01:38 2021
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES
NODELIST(REASON)
125 batch demo dgxuser COMPLETI 0:09 1-00:00:00 1 sdc2-hpc-dgx-a100-001

Cancel a job with scancel

dgxuser@sdc2-hpc-login-mgmt001:~$ squeue
dgxuser@sdc2-hpc-login-mgmt001:~$ scancel JOBID

Running job with module

List of available modules

dgxuser@sdc2-hpc-login-mgmt001:~$ module avail

-----------------------------------------------------------------------------------------------------------------
/sw/modules/all -------------------------------------------------------------------------------------------------
-----------------
mpi/3.0.6 python/2.7.18 python/3.6.10 python/3.8.10 python/miniconda3/miniconda3
python/pytorch/1.9.0+cu111 python/tensorflow/2.3.0

Use "module spider" to find all possible modules.

Use "module keyword key1 key2 ..." to search for all possible modules matching any of the
"keys".

Create your environment

dgxuser@sdc2-hpc-login-mgmt001:~$ module load python/miniconda3/miniconda3
dgxuser@sdc2-hpc-login-mgmt001:~$ conda create -p
/lustre/scratch/client/vinai/users/youruser/yourfolder python=yourversion
dgxuser@sdc2-hpc-login-mgmt001:~$ conda activate yourenv

Installation of your lib and packages you want (prefer using pip). Export proxy if you have
a problem with internet connection.
export HTTP_PROXY=http://proxytc.vingroup.net:9090/
export HTTPS_PROXY=http://proxytc.vingroup.net:9090/
export http_proxy=http://proxytc.vingroup.net:9090/
export https_proxy=http://proxytc.vingroup.net:9090/
Example run job with 1 node A100, 4 Gpus:

dgxuser@sdc2-hpc-login-mgmt001:~$ cat conda.sh

#!/bin/bash -e
#SBATCH --job-name=py-job
#SBATCH --output=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.out
#SBATCH --error=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.err
#SBATCH --gpus=4
#SBATCH --nodes=1
#SBATCH --mem-per-gpu=36G
#SBATCH --cpus-per-gpu=8
#SBATCH --partition=batch or phase2
#SBATCH --mail-type=all
#SBATCH [email protected] //your email

module purge
module load python/miniconda3/miniconda3
eval "$(conda shell.bash hook)"
conda activate /lustre/scratch/client/vinai/users/youruser/yourfolder

command ...
dgxuser@sdc2-hpc-login-mgmt001:~$ sbatch conda.sh

Running job with docker container

List of available containers on harbor.vinai-systems.com

harbor.vinai-systems.com/library/dc-miniconda:3-cuda10.0-cudnn7-ubuntu18.04
harbor.vinai-systems.com/library/cuda:10.0-cudnn7-ubuntu18.04
harbor.vinai-systems.com/library/pytorch:1.4.0-python3.7-cuda10.1-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-tensorflow:1.14.0-python3.7-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-python:3.6-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-tf-torch:1.15.0-1.4.0-python2.7-cuda10.0-cudnn7-
ubuntu16.04
harbor.vinai-systems.com/library/dc-miniconda:3-cuda10.1-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/miniconda:3-cuda10.1-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-pytorch:1.4.0-python3.7-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-miniconda:3-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/miniconda:3-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/pytorch:1.4.0-python3.7-cuda10.0-cudnn7-ubuntu16.04

You can build one of your own from nvcr.io. Dockerfile example in the ZipFile attached.
On login node:
docker login harbor.vinai-systems.com (login node account)
docker tag your_image harbor.vinai-systems.com/library/your_image:your_tag
docker push harbor.vinai-systems.com/library/your_image:your_tag
Contact Admin if you want to create account login harbor.vinai-system.com

You can run docker by example:

dgxuser@sdc2-hpc-login-mgmt001:~$ cat container.sh

#!/bin/bash -e
#SBATCH --job-name=container-job
#SBATCH --output=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.out
#SBATCH --error=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.err
#SBATCH --gpus=2
#SBATCH --nodes=1
#SBATCH --mem-per-gpu=36G
#SBATCH --cpus-per-gpu=8
#SBATCH --partition=batch
#SBATCH --mail-type=all
#SBATCH [email protected] //your email
srun --container-image="harbor.vinai-systems.com#library/cuda:10.0-cudnn7-ubuntu18.04" \
--container-mounts=lustre_folder:container_folder \
python …
dgxuser@sdc2-hpc-login-mgmt001:~$ sbatch container.sh

Note: Save your checkpoint to lustre folder

Oscp
No ratings yet
Oscp
3 pages
crtp-exam-update
No ratings yet
crtp-exam-update
10 pages
Pure Mathematicsyear 2 (A Level) Unit Test 10: Integration (Part 1)
No ratings yet
Pure Mathematicsyear 2 (A Level) Unit Test 10: Integration (Part 1)
1 page
Emc 257758
No ratings yet
Emc 257758
7 pages
Exam For Rhcsa Rh134 7
0% (2)
Exam For Rhcsa Rh134 7
17 pages
8700 Hands On Cheat Sheet
No ratings yet
8700 Hands On Cheat Sheet
9 pages
Some Tutorials in Computer Networking Hacking
From Everand
Some Tutorials in Computer Networking Hacking
Dr. Hidaia Mahmood Alassouli
No ratings yet
Cummins K19 Maintainence Schedule
100% (1)
Cummins K19 Maintainence Schedule
13 pages
College Algebra Quick Refer
100% (3)
College Algebra Quick Refer
2 pages
Sap R/3 Financials: Accrual Engine
100% (1)
Sap R/3 Financials: Accrual Engine
21 pages
Installation: Requirements
No ratings yet
Installation: Requirements
11 pages
Mininet Openflow
No ratings yet
Mininet Openflow
18 pages
HPC Rosalind Gettingstarted
No ratings yet
HPC Rosalind Gettingstarted
6 pages
11g grid installation
No ratings yet
11g grid installation
8 pages
CSE Assignment
No ratings yet
CSE Assignment
22 pages
Grid Infrastructure Installation
No ratings yet
Grid Infrastructure Installation
14 pages
Active
No ratings yet
Active
10 pages
Oracle
No ratings yet
Oracle
45 pages
infoscale-dbnode01_20240327_02
No ratings yet
infoscale-dbnode01_20240327_02
32 pages
12c Beq Connection As Sysdba PDF
No ratings yet
12c Beq Connection As Sysdba PDF
6 pages
Senior DBA Interview Questions
No ratings yet
Senior DBA Interview Questions
8 pages
Create SAP Router
No ratings yet
Create SAP Router
8 pages
Ogg Xag Agctl
No ratings yet
Ogg Xag Agctl
6 pages
TSM Commands
100% (1)
TSM Commands
29 pages
serverservices_gpu-cluster [LME - WIKI]
No ratings yet
serverservices_gpu-cluster [LME - WIKI]
4 pages
Linux Commads
100% (1)
Linux Commads
58 pages
CLI in Eng Mode
No ratings yet
CLI in Eng Mode
13 pages
User Manual
No ratings yet
User Manual
17 pages
DR Doc MPMKVVCL v1
No ratings yet
DR Doc MPMKVVCL v1
12 pages
HPUX by Shrikant
No ratings yet
HPUX by Shrikant
21 pages
Review RH134
No ratings yet
Review RH134
18 pages
3PAR Storage Multi-Path (Hyperactive) + RHEL7.2 + ORACLE 11G-RAC Environme NT Construction Notes
No ratings yet
3PAR Storage Multi-Path (Hyperactive) + RHEL7.2 + ORACLE 11G-RAC Environme NT Construction Notes
17 pages
Oracle Samples
No ratings yet
Oracle Samples
39 pages
Using The Cluster
No ratings yet
Using The Cluster
8 pages
How To Install Apache Postgresql PHP and Run Time Trex Workforce Management in Centos 6 - 5
No ratings yet
How To Install Apache Postgresql PHP and Run Time Trex Workforce Management in Centos 6 - 5
15 pages
raspi docker init config
No ratings yet
raspi docker init config
13 pages
Installation Steps
No ratings yet
Installation Steps
30 pages
SGE Basic Commands: Submitting A Job To The Queue: Qsub
No ratings yet
SGE Basic Commands: Submitting A Job To The Queue: Qsub
3 pages
HC Mop
No ratings yet
HC Mop
10 pages
Running Jobs in The Background
No ratings yet
Running Jobs in The Background
7 pages
Linux Some Commands
No ratings yet
Linux Some Commands
79 pages
Tecnologico de Estudios Superiores de Ecatepec: Proyecto Final Minishell
No ratings yet
Tecnologico de Estudios Superiores de Ecatepec: Proyecto Final Minishell
17 pages
Session-102 (Rootless - Rootful Containers)
No ratings yet
Session-102 (Rootless - Rootful Containers)
6 pages
RAC Commands
No ratings yet
RAC Commands
7 pages
Changing Media Spped in Aix
No ratings yet
Changing Media Spped in Aix
7 pages
Attackive Directory
No ratings yet
Attackive Directory
6 pages
Killer Shell - Exam Simulators
No ratings yet
Killer Shell - Exam Simulators
51 pages
Setup Your Own OpenQRM Cloud On Ubuntu Lucid Lynx.10052010
No ratings yet
Setup Your Own OpenQRM Cloud On Ubuntu Lucid Lynx.10052010
51 pages
Ebs Cli
No ratings yet
Ebs Cli
3 pages
Accel PPP
100% (1)
Accel PPP
46 pages
Wallet
No ratings yet
Wallet
12 pages
Linux Interview Questions
No ratings yet
Linux Interview Questions
6 pages
Nouveau Document Texte
No ratings yet
Nouveau Document Texte
14 pages
Killing My Sessions in Oracle
No ratings yet
Killing My Sessions in Oracle
6 pages
Installing Oracle Database 11g and Grid ASM 18c on VM VirtualBox
No ratings yet
Installing Oracle Database 11g and Grid ASM 18c on VM VirtualBox
42 pages
How To Install Openshift On A Laptop or Desktop
100% (1)
How To Install Openshift On A Laptop or Desktop
7 pages
Windows CLI
No ratings yet
Windows CLI
86 pages
Dimacuha Von Oslab
No ratings yet
Dimacuha Von Oslab
35 pages
HP Service Guard Configuration Guide
100% (1)
HP Service Guard Configuration Guide
11 pages
SC Docs
No ratings yet
SC Docs
6 pages
Rac12r2 Sol11.4 Install
No ratings yet
Rac12r2 Sol11.4 Install
8 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hedaya Alasooly
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet
Iso 4316 1977
No ratings yet
Iso 4316 1977
4 pages
DUAL NATURE Test
No ratings yet
DUAL NATURE Test
2 pages
Assignment Fluid I
No ratings yet
Assignment Fluid I
5 pages
Unit-4 JSP - Notes - WT
No ratings yet
Unit-4 JSP - Notes - WT
28 pages
Century Iib: Autopilot Flight System
No ratings yet
Century Iib: Autopilot Flight System
24 pages
Astm C42-13
No ratings yet
Astm C42-13
7 pages
WINSEM2024-25_BCSE205L_TH_VL2024250501432_2024-12-13_Reference-Material-I
No ratings yet
WINSEM2024-25_BCSE205L_TH_VL2024250501432_2024-12-13_Reference-Material-I
23 pages
paul2019
No ratings yet
paul2019
19 pages
01-Course Outline Material Science 14 Weeks (Edit Syam Nov 19)
No ratings yet
01-Course Outline Material Science 14 Weeks (Edit Syam Nov 19)
7 pages
Command Manual (Ipv4 Routing Volume I) : Router/Ethernet Switch
No ratings yet
Command Manual (Ipv4 Routing Volume I) : Router/Ethernet Switch
139 pages
Abstract Classes
No ratings yet
Abstract Classes
11 pages
CHEM101 172 Final Solved
No ratings yet
CHEM101 172 Final Solved
12 pages
Resource-Constrained Machine Learning For ADAS: A Systematic Review
No ratings yet
Resource-Constrained Machine Learning For ADAS: A Systematic Review
26 pages
Forced Vibration CU
No ratings yet
Forced Vibration CU
5 pages
7 - Fruit Battery
No ratings yet
7 - Fruit Battery
4 pages
Vacuum Technology Simplified Ebook
100% (1)
Vacuum Technology Simplified Ebook
80 pages
Copy of ANKITESE Video Book
No ratings yet
Copy of ANKITESE Video Book
10 pages
CVT Lab Report
No ratings yet
CVT Lab Report
13 pages
Samejima 1997
No ratings yet
Samejima 1997
16 pages
Classification of Sex Positions
No ratings yet
Classification of Sex Positions
12 pages
Unit I Part-A (Questions Wirh Answers)
No ratings yet
Unit I Part-A (Questions Wirh Answers)
15 pages
Highland College Bahir Dar Campus Department of Nurs ING: Applying Basic Health Statistics and Survey
100% (2)
Highland College Bahir Dar Campus Department of Nurs ING: Applying Basic Health Statistics and Survey
51 pages
Introduction To Aerospace Engineering
No ratings yet
Introduction To Aerospace Engineering
5 pages
Intermediate Jigsaw Sudoku by Krazydad, Volume 1, Book 45
100% (1)
Intermediate Jigsaw Sudoku by Krazydad, Volume 1, Book 45
10 pages
Crop Prediction Using Machine Learning
No ratings yet
Crop Prediction Using Machine Learning
6 pages
Bendix King KNS80 Manual
100% (1)
Bendix King KNS80 Manual
20 pages

Slurm Usage Guide

Uploaded by

Slurm Usage Guide

Uploaded by

Slurm Usage Guide

SSH flow: Get into hanoi -> then go login-sp.vinai-systems.com

Login with account AD.

HOME_FOLDER_ISILON <=> /home/your_username (on loginNode) <=>

SUPERPOD_STORAGE_DDN_FOLDER <=> /lustre/scratch/client (on all node)

Running a job with srun

Running an interactive job

dgxuser@sdc2-hpc-login-mgmt001:~$ srun --partition=batch --pty /bin/bash --time=02:00:00

Run a batch job

dgxuser@sdc2-hpc-login-mgmt001:~$ cat script.sh

Resources can be requested in several different ways:

sbatch/srun Option Description

Observing running jobs with squeue

Cancel a job with scancel

Running job with module

dgxuser@sdc2-hpc-login-mgmt001:~$ module avail

Use "module spider" to find all possible modules.

Create your environment

dgxuser@sdc2-hpc-login-mgmt001:~$ cat conda.sh

Running job with docker container

You can run docker by example:

dgxuser@sdc2-hpc-login-mgmt001:~$ cat container.sh

Note: Save your checkpoint to lustre folder

You might also like