0% found this document useful (0 votes)
39 views

Using The Batch Farm: Technische Universität München

The document provides an overview of using the Batch Farm computing resources at Technische Universität München. It discusses [1] the infrastructure including 21 compute nodes with 570 cores and GPU job slots, [2] the differences between parallel and single job computing, and [3] basic commands for submitting, monitoring, and cancelling jobs. It also provides examples of arranging and submitting parallel and single jobs to the computing cluster.

Uploaded by

Foad WM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Using The Batch Farm: Technische Universität München

The document provides an overview of using the Batch Farm computing resources at Technische Universität München. It discusses [1] the infrastructure including 21 compute nodes with 570 cores and GPU job slots, [2] the differences between parallel and single job computing, and [3] basic commands for submitting, monitoring, and cancelling jobs. It also provides examples of arranging and submitting parallel and single jobs to the computing cluster.

Uploaded by

Foad WM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Technische Universität München

Using the Batch Farm


Prologue

• All information + scripts from this talk


also available in

A) transfer.ktas.ph.tum.de
B) /home/www/papers/computing
Overview

• Infrastructure
• Parallel vs single job computing
• Basic commands
• How to …
… arrange a job
… send a job
… monitor my stuff
• Please don’t…
Infrastructure

• 21 compute nodes → 570 cores


• ~ 2 Gb RAM / cores
• 20 GPU job slots
• Standard queue: 2,5h / job
• Long queue: 12h / job
• Local storage ~100 Gb per node
• 1/10 Gbit/s network connection / node

SLURM job scheduler


https://www.schedmd.com
Parallel vs single job

Parallel running Single running

Job 1 Job 4 Job 7 Job 1


Job 2 Job 5 Job 8 Job 2
Job 3 Job 6 Job 9 Job 3

• Independent jobs • Code development


• Parameter scans • Compiling
• MC production • Create Plots / Graphs
• Data analysis (runwise) • Small nTuple analysis
• Creating of independent • Merging of several
output files files
Example: Parallel job
Detector Summary Tape File Analysis (DSTs)

Problem
• 1000 files with 250 events/file

Solution
• Create code locally
• Analyse 1 file per job
• Create 1 output file per job (Plots, Ntuples…)
• Send 1000 jobs to farm
• Merge plots/ntuples afterwards
Example: Single job
Fitting of a peak in plot

Problem
• Fit peaks in 1 or 2 plots

Solution
• Create a macro / program to fit
• Do it locally and check the output

Don’t make life more


complicated than it is!
Basic commands
• sview Here you will get some information about
• sshare the basic commands. Most of them
provide more information, see “command
• sbatch –help”
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview SLURM overview. Job, partition and node
• sshare information in an graphical overview
• sbatch Just enter “sview” in a terminal
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview “Fair share” ranking. (How fast do I get
• sshare the slot for the next job?)
• sbatch Just enter “sshare --all” in a terminal
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview Submit a job to the farm
• sshare
Enter “sbatch --help” for info about the
• sbatch parameters (will be described later)
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview Kill your jobs by id or all of your jobs
• sshare using “scancel –u [ADS]”
• sbatch
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview Gives information about the status of the
• sshare running jobs and the queue.
• sbatch Just enter “squeue” in a terminal
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview Gives information about the nodes,
• sshare queues and user of the farm.
• sbatch Just enter “sinfo” in a terminal
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview A short graphical overview over the users
• sshare currently running jobs on the farm.
• sbatch https://transfer.ktas.ph.tum.de/django/monitor/1/
• scancel
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
Basic commands
• sview A short text based overview over the users
• sshare currently running jobs on the farm.
• sbatch https://transfer.ktas.ph.tum.de/webpage/monitori
• scancel ng_batchfarm.html
• squeue
• sinfo
• Monitoring software
– Graphical
– Text based
How to… arrange a job

• Input:
– File to analyse? (Filelist?)
– Parameters?
• Output:
– Different names/ directories
• Compile before sending to
farm
• How much CPUtime / RAM
• Do I need temporary space?
• Do I need access to /scratch
• Check before farm
Example: Random Numbers

Problem
• Create file with 10 different lines and random
numbers
• Must be scalable to farm

Solution
• Input: the name of the output file has to be given
• Compile
• “Full program”
– Example.cc
– Makefile
→ This generates a executable program
Example: Random Numbers
Example: Random Numbers

• Run it locally to check if it works


How to… send a job

• Select your parameters:


– CPU
– RAM
– Partition
• SLURM can only submit scripts
• Loop over all the jobs you want to submit
• Create a bash/ python script
• Example:
– Create a script with a submit loop (submit.sh)
– Inside, create a temporary script with your job inside
– Run your script
Example: Send 10 jobs
#!/bin/sh
cpu=5 # time limit in minutes for your job,
# will be killed after that time!
mem=100 # ram limit in Mb for your job,
# it will be killed if it exceeds this
nJobs=10 # number of jobs to be performed

# the program is defined here


program=/home/www/papers/computing/programs/Example
name=Example

# the output parameters are defined here


output_path=/home/www/papers/computing/testoutput
output_name=Event
output_end=txt
Example: Send 10 jobs
# generate a random number to identify the jobs stuff exactly
randomID=$RANDOM

for i in `seq 1 $nJobs`; do


tmp_scriptname=/var/tmp/sub_${randomID}_$i.sh

# set your default environment


echo "#!/bin/sh" > $tmp_scriptname
echo ". ~/.bashrc" >> $tmp_scriptname

# execute your program to the local disk


echo "${program} /var/tmp/local_${randomID}_$i.txt" >> $tmp_scriptname

# copy the completed output to your location


echo "cp /var/tmp/local_${randomID}_$i.txt ${output_path}/${output_name}-$i.${output_end}" >>
$tmp_scriptname
# clean up your stuff
echo "rm /var/tmp/local_${randomID}_$i.txt " >> $tmp_scriptname

# submit your temporary script to the farm


sbatch --mem-per-cpu=${mem} --time=${cpu} --job-name=$name-${counter} ${tmp_scriptname}

# delete your temporary script rm -rf ${tmp_scriptname}


Example: Send 10 jobs

# submit your temporary script to the farm


sbatch --mem-per-cpu=${mem} --time=${cpu} --job-name=$name-${counter}
${tmp_scriptname}

# delete your temporary script


rm -rf ${tmp_scriptname}

done
How to… monitor my stuff

• Check your jobs frequently (squeue…)


– Do they disappear suddenly?
– Do they go down too fast?

• Check the log files in case of problems


– What is written there?
– Is it depending on one
machine?

• Try to run a job locally


Error handling

• Have you checked the logfile?


Don’t call an
• Are your scripts and code valid? admin without
having checked all
points!
• Is your data available?

• Is the fileserver present or


under heavy usage?

• Do your jobs last unusually


long?
Important Notes

Some important notes:


• Don’t use /tmp. Use /var/tmp
• Don’t write directly to /scratch, copy at the end of the job
• Clean up after your job
• Try to stay under 50k jobs at one time
• Adjust your CPU and RAM
usage reasonable
• Always check your work

• Be friendly to the others 


Questions?

You might also like