Slurm Usage Guide
Slurm Usage Guide
Concept
ssh hanoi
ssh <username>@login-sp.vinai-systems.com
Ex: ssh [email protected]
PERSONAL_STORAGE_DDN_FOLDER <=>
/lustre/scratch/client/vinai/user/your_username
You have to put your training data in DDN Storage, HOME ISILON will be used for data
archive longterm.
Introduction
Slurm is an open-source job scheduling system for Linux clusters, most frequently used for
high-performance computing (HPC) applications. This guide will cover some of the basics to
get started using slurm as a user. For more information, the Slurm Docs are a good place to
start.
After slurm is deployed on a cluster, a slurmd daemon should be running on each compute
system. Users do not log directly into each compute system to do their work. Instead, they
execute slurm commands (ex: srun, sinfo, scancel, scontrol, etc) from a slurm login node.
These commands communicate with the slurmd daemons on each host to perform work.
Simple Commands
Cluster state with sinfo
To "see" the cluster, ssh to the slurm login node for your cluster and run the `sinfo`
command:
dgxuser@sdc2-hpc-login-mgmt001:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 1-00:00:00 8 idle sdc2-hpc-dgx-a100-[001-008]
batch* up 1-00:00:00 2 down sdc2-hpc-dgx-a100-[013,015]
There are 8 nodes available on this system, all in an idle state. If a node is busy, its state will
change from idle to alloc. If a node is down, its state will change from idle to down.
dgxuser@sdc2-hpc-login-mgmt001:~$ sinfo -lN
Fri Jul 16 10:47:52 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT
AVAIL_FE REASON
sdc2-hpc-dgx-a100-001 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-002 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-003 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-004 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-005 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-006 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-007 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-008 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-013 1 batch* down 256 2:64:2 103100 0 1 (null) VinAI use
sdc2-hpc-dgx-a100-015 1 batch* down 256 2:64:2 103100 0 1 (null) VinAI use
The `sinfo` command can be used to output a lot more information about the cluster. Check out
the sinfo doc for more information.
During interactive mode, the resource is being reserved for use until the prompt is exited (as
shown above). Commands can be run in succession.
Note: before starting an interactive session with srun it may be helpful to create a session
on the login node with a tool like tmux or `screen`. This will prevent a user from losing
interactive jobs if there is a network outage or the terminal is closed.
More Advanced Use
dgxuser@sdc2-hpc-login-mgmt001:~$ squeue
dgxuser@sdc2-hpc-login-mgmt001:~$ scancel JOBID
-----------------------------------------------------------------------------------------------------------------
/sw/modules/all -------------------------------------------------------------------------------------------------
-----------------
mpi/3.0.6 python/2.7.18 python/3.6.10 python/3.8.10 python/miniconda3/miniconda3
python/pytorch/1.9.0+cu111 python/tensorflow/2.3.0
Installation of your lib and packages you want (prefer using pip). Export proxy if you have
a problem with internet connection.
export HTTP_PROXY=http://proxytc.vingroup.net:9090/
export HTTPS_PROXY=http://proxytc.vingroup.net:9090/
export http_proxy=http://proxytc.vingroup.net:9090/
export https_proxy=http://proxytc.vingroup.net:9090/
Example run job with 1 node A100, 4 Gpus:
module purge
module load python/miniconda3/miniconda3
eval "$(conda shell.bash hook)"
conda activate /lustre/scratch/client/vinai/users/youruser/yourfolder
command ...
dgxuser@sdc2-hpc-login-mgmt001:~$ sbatch conda.sh
harbor.vinai-systems.com/library/dc-miniconda:3-cuda10.0-cudnn7-ubuntu18.04
harbor.vinai-systems.com/library/cuda:10.0-cudnn7-ubuntu18.04
harbor.vinai-systems.com/library/pytorch:1.4.0-python3.7-cuda10.1-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-tensorflow:1.14.0-python3.7-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-python:3.6-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-tf-torch:1.15.0-1.4.0-python2.7-cuda10.0-cudnn7-
ubuntu16.04
harbor.vinai-systems.com/library/dc-miniconda:3-cuda10.1-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/miniconda:3-cuda10.1-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-pytorch:1.4.0-python3.7-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-miniconda:3-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/miniconda:3-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/pytorch:1.4.0-python3.7-cuda10.0-cudnn7-ubuntu16.04
You can build one of your own from nvcr.io. Dockerfile example in the ZipFile attached.
On login node:
docker login harbor.vinai-systems.com (login node account)
docker tag your_image harbor.vinai-systems.com/library/your_image:your_tag
docker push harbor.vinai-systems.com/library/your_image:your_tag
Contact Admin if you want to create account login harbor.vinai-system.com