Using the BMRC Cluster

Introduction

The BMRC Cluster is made up of a small number of login nodes and a larger number of compute nodes. Users don't run their code directly on the compute nodes but instead submit jobs to a job scheduler, and the job scheduler manages distributing those jobs to the compute nodes. The BMRC Cluster uses the Slurm job scheduler to distribute jobs to the compute nodes.

Please note that the login nodes are only a small part of the BMRC Cluster. You should not run compute- or memory-intensive processes on the login nodes as doing so impacts all users of the cluster. Such work should instead be submitted to the job scheduler for execution on the compute nodes.

Interactive and non-interactive cluster use: srun and sbatch

There are two different but complementary ways of getting your code running on the cluster. The srun command can be used to give an interactive bash shell running on a compute node. The sbatch command can be used to run an application or script non-interactively on a compute node.

Prequisites:

You will need a cluster account

You need to know how to login.
You will need to know which scheduler accounts you have access to (please ask your PI/group administrator).

SRUN

Using the srun command you can request an interactive cluster session in which you are given a live terminal (aka shell) on one of the cluster compute nodes. An interactive cluster session allows you to type commands and run code live from the terminal, just as you would from cluster1-2, but with the benefit to yourself of giving you dedicated resources and with the benefit to others that your code is not running on the login nodes.

To request a session, run:

srun -p short --pty bash

The output you see on screen will look as follows:

[username@rescomp1] srun -p short --pty bash

srun: job 84496 queued and waiting for resources

srun: job 84496 has been allocated resources

[username@compe028 ~]$

In the above output, you can read that your srun session has been successfully scheduled and you have been given an interactive cluster session on compe028. Notice how the prompt changes to [username@compe028] to show that you are now working on one of the cluster nodes rather than on rescomp1-2.

When you have finished with your interactive session, logout as normal by typing e.g. exit and you will return to rescomp1-2. If you wish to take a break from your interactive session then we recommend starting your interactive job from within a screen or tmux session and allowing screen or tmux to keep your interactive session alive.

The srun command accepts many of the same parameters and behaves similarly to the sbatch command described in detail below. In particular, please note that the queue constraints still hold i.e. a session started with 'srun -p short' runs on the 'short' partition and is subject to that partition's time limit of 30 hours.

SBATCH

The sbatch command is used to submit scheduled non-interactive jobs.

SUBMITTING JOBS USING SBATCH - STEP BY STEP GUIDE FOR NEW USERS

First logon to cluster1.bmrc.ox.ac.uk following the login guide and ensure that you are in your home directory by running:

$ cd ~

In practice, the easiest and most flexible method to submit a cluster job is to prepare and then submit a bash script file. This bash script file can contain configuration settings for the scheduler, can load pre-installed software, can set any environment variables that your software needs, and then can launch your main software. A bash script is also useful when doing repetitive work such as running the same analysis several times with different input files (see array jobs below).

We begin with a simple example. Once logged into cluster1.bmrc.ox.ac.uk make a subdirectory in your home folder called sbatch-test and change directory into it:

$ mkdir sbatch-test
$ cd sbatch-test

Now copy the contents below into a file named hello.sh into your sbatch-test folder.

#!/bin/bash

echo "------------------------------------------------"
echo "Run on host: "`hostname`
echo "Operating system: "`uname -s`
echo "Username: "`whoami`
echo "Started at: "`date`
echo "------------------------------------------------"

sleep 60s
echo "Hello, world!"

This script works as follows.

The first line #!/bin/bash is a special command which says that this script should use the bash shell. You will normally want to include this in every bash script file you write.
In the next lines (all beginning echo...) we print some useful debugging information. This is included here just for information.
The line sleep 60s causes our script to pause for 60 seconds.
The final line prints the message ‘Hello, world!’

Once your script is saved you can submit it using:$ sbatch -p short hello.sh
Submitted batch job 84481

This will submit the example script (hello.sh) located in your current working directory. When you submit this job, the scheduler will report back that it has received your job and assigned it an ID number - see below for why this number is very important.

To run even such a simple task as this on the cluster involves a number of important steps that it will be helpful to understand. These are now discussed below.

Job ID

Note first that every job submitted to the scheduler receives a JOB ID number. In the example above, the job id number was 84481 but your job id number will be different (indeed, because the BMRC cluster handles so many jobs, it is not unusual to have a job id number in the millions).

The ID number of your computing job is the most important piece of information about your job. When you need to contact the BMRC team for help, please remember to tell us what your job id was. This will make it considerably easier for us to help you.

Log files

When you run a cluster job, by default the scheduler produces one file, which it names in the format slurm-<jobid>.out. This contains both the output and the errors.

If your job (“hello.sh”) has already finished, you will be able to see this file in your current directory by running:

$ ls
hello.sh slurm-84481.out

NB If running the ls command does not show these files, wait a few more seconds and try again.

It is possible to specify that the output and the errors be separated into two different files as discussed later. Usually, the output file ends in .out and the error file in .err. The significance of the two files derives from the Linux convention that running a command or a script in the terminal produces two output streams, called the Standard Output (stdout) and Standard Error (stderr). (It can of course also send output directly to files on disk). Commands or scripts send their normal output to stdout while error messages go to stderr. When you run a command directly at the terminal, both streams are normally sent back to the terminal so you see them both on screen together. However, when running as a cluster job both streams are redirected to a file as happens here.

To see the contents of the log file, you can use the cat command. (For longer log files, the less command is probably more useful).

SBATCH parameters and environment

As we have seen above, the scheduler software can be used very simply. However, it is also highly adaptable by passing certain configuration parameters. We will look at some of those parameters in more detail below. For now, we will focus on one important parameter - the environment.

When working in the shell in Linux, your commands are run in a particular context called the environment, which contains certain variables and their values. Your environment contains, for example, the PWD variable which stores which directly you are currently working in. You can see this by running echo $PWD. When you run a command directly at the shell, your command inherits your current environment; however, submitting jobs to the scheduler is different. These jobs will not run in your current shell and, indeed, they will not even run on the same computer you are currently logged into. What value, then, should the scheduler use for your current working directory (i.e. the PWD variable)?

The Slurm scheduler propagates the current working directory of the shell from which the script was submitted to the script unless it is explicitly told to do otherwise. It is possible to tell Slurm what the working directory should be with the -D or --chdir option.

Changing the behaviour of the sbatch command can be achieved in three different ways:

Extra parameters can be added directly on the command line. For example, to tell the scheduler to use the slurm_test directory in your well directory as a working directory specify -D=/well/project/users/username/slurm_test replacing project and username with your project and username.

$ sbatch -D=/well/project/users/username/slurm_test hello.sh

Extra parameters can be added to your script files using the special script syntax #SBATCH. For example, instead of adding -D to the command line you could add them to your bash script file like this:

...

#SBATCH -D /well/project/users/username/slurm_test
...

Environment variables beginning with SBATCH_ or SLURM_ can be set in the .bashrc file in your home directory, which is sourced every time you start a shell This is probably best used for configuration settings that you won’t change very often. For example, to tell slurm to run your jobs within a particular project you would add lines like:

...
export SBATCH_ACCOUNT=project.prj
...

IMPORTANT - avoid inheriting environments

By default the Slurm scheduler makes a copy of the user's shell environment, including all environment variables, at the time of submission and adds these to the environment when your scheduled job is run on a cluster node.

You can force the scheduler not to propagate any environment variable except those beginning with SLURM_ or SBATCH_ with the --export=NONE parameter. Unfortunately, this will give you a completely clean environment without any other environment variables including, for example, PATH, which is used to specify the default paths where the shell searches for commands and executables.

However, in particular, please note that the default parameter interferes with the automatic selection of software and modules as discussed in our module guide. If you are submitting to an Ivybridge node, such as the himem nodes, it is best to submit from rescomp3, which is itself an ivybridge node. This is because the modules for Skylake compatible and Ivybridge only compatible nodes are different and Slurm will inherit the module settings from the submission node. It is also good practice to submit from an environment with no modules or other extra environments, such as Python venvs, loaded and load these within your script for reproducibility. It might be worth putting module purge at the top of your script before loading the required modules to unload any modules inherited from the submission nodes.

Specifying parameters in your batch script file

The sbatch command can be used to run jobs very simply. However, it also accepts a wide range of configuration settings which we can add. For example, the option -p specifies which partition to submit to. For example, we could resubmit hello.sh to a different queue as follows:

$ sbatch -p long hello.sh

Note that, although we are resubmitting the same script simple.sh to the cluster, this still represents a new job so the scheduler will issue a new job id.

Typing all the desired options for job at the command line is certainly possible, but it also has disadvantages: not only does it require more typing, it makes it more difficult to record which options you previously used. So, we recommend instead that you include configuration parameters directly in your script file.

Using your favourite editor, let’s update the hello.sh file as follows by adding the directives to the top of the file below the line #!/bin/bash.

#!/bin/bash

#SBATCH -A project.prj
#SBATCH -J my-job

#SBATCH -o my-job-%j.out

#SBATCH -e my-job-%j.err
#SBATCH -p short

echo "------------------------------------------------"
echo "Run on host: "`hostname`
echo "Operating system: "`uname -s`
echo "Username: "`whoami`
echo "Started at: "`date`
echo "------------------------------------------------"

sleep 60s
echo "Hello, world!"

These five lines use the special #SBATCH comment syntax to add some parameters to our sbatch command. In particular:

we specify our project account name using -A . Make sure to substitute your own project name, which is typically the same as your primary group name with .prj on the end. Use id -gn to show you primary group name.
We use -J to specify a name for this job. By default, jobs are named after their script files, but you can specify any name that would be helpful to you.
We use -o my-job-%j.out to specify an output file with the job-id in its name represented by %j.
We also use -e my-job-%j.err to specify an error file.
Finally, we use -p short to specify that the job short run on that particular partition.

Now that we have specified the project account and partition within the script we can submit the script file more simply:

$ sbatch hello.sh
Submitted batch job 84482

Now that your job has been submitted, if you are quick enough (within 60 seconds!) you can use the squeue command to check the status of your job. The output will look something like the below.

$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

84482 short my-job <username> R 0:01 1 compe028

See below for more information on the squeue command. For now, simply note that with the -u argument, squeue will show you the status of all your queued or running jobs. If you have several queued or running jobs and just want to see the status of one of them, you can use squeue -j <job_id>.

Because we told the scheduler to use the folder $HOME/sbatch-test as the working directory for this script, that is where it will put our log file (even though we submitted the script from our $HOME directory). Now let’s look at the log files:
$ ls
hello.sh my-job-84482.out my-job-84482.err
$ cat my-job-84482.out
------------------------------------------------

Run on host: compe028.hpc.in.bmrc.ox.ac.uk

Operating system: Linux

Username: <your-username>

Started at: Mon May 30 17:18:07 BST 2022

------------------------------------------------

Hello, world!

The output shows that this script ran on the node compe028.hpc.in.bmrc.ox.ac.uk, however, you will likely see a different node. It also shows that the cluster node’s operating system was Linux, that the job was run as your username, and that the job started at the time shown.

Since this job ran successfully, the .err file will be empty, which is a reassuring sign! You can check that it’s empty, if you like, using the cat command.

Note that we explicitly specified a partition for this job to be submitted to. If you don't specify a partition, jobs will run by default on the short partition which has a maximum run time of 30 hours. Although this is the default, we recommend always specifying a partition

Instead of writing your own job submission script from scratch you can also use a copy of our job submission template file, located at /apps/scripts/slurm.template.sh which contains hints on a number of commonly used parameters. To use this file, begin my making a copy into your own home directory by running:

cp /apps/scripts/slurm.template.sh ~/myjobscript.slurm.sh

Then you can edit the file for your own purposes.

Slurm partitions

When submitting a job to Slurm, you submit to a partition. A partition is a group of compute nodes with defined default, min and max settings for various resources that the job might request. The following partitions are available, with the maximum settings for the three most-commonly requested resources - job runtime, number of cpus and memory per requested cpu:

Partition	Time limit	Default #CPUs	Default mem per CPU
short	30 hours	1	15.2 GB
long	10 days	1	15.2 GB
relion	10 days	1	15.2 GB
epyc	10 days	1	15.2 GB
fraser	10 days	1	15.2 GB
brcgel	10 days	1	7.5 GB
win	10 days	1	26.2 GB
himem	10 days	1	31.25 GB

For the partitions that run on nodes that contain gpus, please see the lists on the GPU Resources page.

Note that requesting specific hostgroups within a partition can be achieved with constraints as shown below.

Information about the Slurm partitions can be shown with the sinfo command:

$ sinfo

Submitting to specific host groups or nodes

Sometimes, you may wish to target which cluster nodes your job will run more finely than by specifying just a queue. In this case, it is possible to add further information to your queue specification.

Specifying sbatch --constraint=”hga” will run only on A nodes, while sbatch --constraint=”skl-compat” will run only on the nodes with Skylake or later CPU microarchitecture. The normal list of “hostgroups” that you can use for these purposes are:
- intel - is an intel node
- ivybridge - is an ivybridge node
- ivy - alias for ivybridge
- ivy-compat - is capable of running code compiled for ivybridge
- skylake - is a skylake node
- skl - alias for skylake
- skl-compat - capable of running code compiled for sklylake
- cascadelake - is a cascadelake node
- csl - alias for cascadelake
- csl-compat - is capable of running code compiled for cascadelake
- icelake - is an icelake node
- icl - alias for icelake
- icl-compat - is capable of running code compiled for icelake
- epyc - is an epyc node
- amd - is an AMD node
- rome - is a rome node
- rme - alias for rome
- rme-compat - is capable of running code compiled for rome
- hostgroupe - e nodes
- hge - alias for hostgroupe
- hostgroupf - f nodes
- hgf - alias for hostgroupf
- hostgroupa - a nodes
- hga - alias for hostgroupa
- hostgrouph - h nodes
- hgh - alias for hostgrouph
- hostgroupw - win
- hgw - alias for hostgroupw

It is rarely desirable to limit running your job to a particular node, but in cases where this is desirable, you can use the ‘-w’ or ‘--nodelist’ specification similarly to specify a particular target node or list of nodes.

Specifying sbatch -w=”compd024” will run only on node compd024.

Checking or deleting running jobs

A busy computing cluster handles many thousands of computing jobs every day. For this reason, when you submit a computing job to the cluster your job may not run immediately. Instead, it will be held in its queue until the scheduler decides to send it to a cluster node for execution. This is one important difference to what happens when you run a job on your own computer.

You can check the status of your job in the queue at any time using the squeue command. Running squeue on its own will report the status of every job on the cluster which remains in the queue. Use

squeue -u <username>

to show only your jobs.

Alternatively, use

squeue -j <jobid>

to report the status of an individual job.

The output of these commands will look like this:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

84497 short my-job <user> R 0:09 1 compe028

squeue reports lots of useful information such as job id, the time the job has been running for, and the nodes the job is running on or the reason it is still pending. However, perhaps the most important piece of information is in the ST i.e. state column. For example, PD means that the job is held in the queue and waiting to start. Alternatively, a state of R means that the job is currently running.

To list the jobs currently running for a user use

squeue -u <username> -t RUNNING

To list the jobs currently pending for a user use:

squeue -u <username> -t PENDING

To list a user's jobs in a particular partition use

squeue -u <username> -p <partition>

To show detailed information about a job in the queue use

scontrol show job <jobid>

The -dd option will cause the command to show additional details. If <jobid> is not specified then scontrol will show statistics for all jobs.

One important thing to note is that the squeue and scontrol commands report only jobs which are either waiting to run or currently running. Once a job has finished squeue will no longer report it. Instead, once a job has submitted you can see a detailed audit, including total run time and memory usage, by running

sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

Without the --format option sacct will show a default set of basic information. Additional fields can be added or removed from the list above. A full list of those available can be seen with

sacct --helpformat

To see the above information for all jobs for a user use

sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed

There is also a command to show a status report for running jobs only

sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps

This is limited to your own jobs. The available fields are different to sacct and can be displayed with the --helpformat option.

Occasionally, you may wish to delete a job before it starts or before it completes. For this you can use the

scancel <jobid> command

A template bash script is available at /apps/scripts/slurm.template.sh which you can copy and modify for your own needs.

Preparing a bash script file

We’ll start by using the bash script template located at /apps/scripts/slurm.template.sh. You should take a copy of this file to use as a starting point for your own scripts. The file will begin with the following content:

#!/bin/bash

# Specify a job name

#SBATCH -J jobname

# Account name and target partition

#SBATCH -A group.prj

#SBATCH -p short

# Log locations which are relative to the current

# working directory of the submission

#SBATCH -o output.out

#SBATCH -e error.err

# Parallel environment settings

# For more information on these please see the documentation

# Allowed parameters:

# -c, --cpus-per-task

# -N, --nodes

# -n, --ntasks

#SBATCH -c 1

# Some useful data about the job to help with debugging

echo "------------------------------------------------"

echo "Slurm Job ID: $SLURM_JOB_ID"

echo "Run on host: "`hostname`

echo "Operating system: "`uname -s`

echo "Username: "`whoami`

echo "Started at: "`date`

echo "------------------------------------------------"

# Begin writing your script here

echo $SLURM_JOB_ID

# End of job script

The first line #!/bin/bash specifies that this is a bash script (this will always be the same).

Any line beginning with # is a bash comment i.e. it will be ignored by the bash interpreter when the script file is actually run. Comments are most often there to make the code more human readable, i.e. to explain what the code does to any human readers. However, our template file also contains a number of special comments which begin #SBATCH. These are special comments which provide configuration settings which will be read and understood by the scheduler.

For example, it is often helpful to specify a name for your job - you can do this by changing the line #SBATCH -J jobname. Some of the configuration settings in the template file are optional but others are required. For example, submitting a job to the cluster requires that you specify an account to associate the job with. This is used for accounting and also defines which of your affiliations the work is attributed to. Each group that can use the cluster will be given an account, often associated with the PI of the group. Although each user has a default account we recommend always specifying the account in the job script.

Including configuration settings in your bash script has the benefit of saving typing if you ever want to run the same or a similar job again and helps to keep a record of which settings were used. If you have some scheduler configuration settings in your bash script, you can temporarily override them by specifying a different parameter at the command line when you submit the job.

The penultimate section of the bash script template uses the echo command to output some information which can be useful in debugging.

Finally, the script runs the actual computing job. In this case, the job is a trivial one of printing out the job id, which is achieved by the line echo $JOB_ID. However, it is possible to achieve very complex things in your bash script file including loading specific software modules, looping over a set of data or running a whole pipeline in sequence.

Commonly used sbatch parameters

The larger the amount of resource (e.g. runtime, number of cpus and/or memory) requested for the job, the longer it is likely to take for your job to be scheduled. We recommed keeping to 12 cores or less and to the default memory-per-core or less unless you have a particular need for more. The current maximums are 48 cores (A nodes in the long and short partitions, H nodes in the himem partition) and 2028 GB total memory per job (on the himem partition). We recommend to request no more than you need, as this will aid scheduling and so maximise throughput. As an extreme example, if you have a choice between running a single 48 core, 729.6 GB job or segmenting it into 48 single-core, 15.2 GB jobs, the latter are far more likely to get scheduled and hence finish soonest. Similarly for job runtime, if you know your job will definitely complete in four hours, set a four hour timelimit. This will make it easier for Slurm to allocate the resources the job requires.

Run man sbatch to see the full list of sbatch options. However, here are some commonly used options as a guide:

-A <project>.prj	Specify which account your job belongs to
-p <partitionname>	Specify which partition to run a job in. See the full list of cluster partitions.
-o <path/to/file> No -e argument set	Instead of separate output and error log files, send the error output to the regular output log file
-o <path/to/file.o> -e <path/to/file.e>	Put the output log file (-o) and/or error log file (-e) into the specified files.
-D </path/to/dir>	Use /path/to/dir as the working directory when executing your script
-t T --time=T	Change the time limit on the job from the default configured on the partition. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". If you ask for more than the maximum time limit configured on the partition then your job will be queued but never get scheduled.
-c N --cpus-per-task=N	Request N CPU cores. If there is no specific request for memory, the memory allocated will be the default for the partition multiplied by the number of cores - see the list of partitions for the defaults.
--mem-per-cpu=M	Request M memory per cpu, rather than using the default configured on the partition. Incompatible with '--mem'. If you don't know the memory requirements of your job then we recommend just leaving it to the defaults.
--mem=M	Request M memory per job, rather than using the number of cpus multiplied by the default memory per cpu configured on the partition. Incompatible with '--mem-per-cpu'. If you don't know the memory requirements of your job then we recommend just leaving it to the defaults.
-a indexes --array=indexes	Submit a job array, multiple jobs to be executed with identical parameters. See the section ARRAY JOBS below for more information, including how to specify the indexes parameter.
--wrap=”<executable>”	Wrap a binary automatically in a simple shell script so that it can be submitted with sbatch directly
ARGUMENTS YOU PROBABLY DON'T WANT TO USE
-N N --nodes=N	Request N nodes be allocated to the job. You probably don't want to use this argument, as most applications run on the BMRC Cluster do not scale beyond a single node, and those that do by using mpi should be started using srun instead. There are some clever things that can be done - but you almost certainly want the '-c' argument instead.
-n N --ntasks=N	Request that slurm allocate resources for N tasks and run the submitted script. Slurm expects the script to know how to use the allocated resources and not that N instances of the script get run. Again, possibilities for clever things, but you almost certainly mean to submit an array job and should use the '-a' argument instead

Array jobs

So far, our examples have involved running just a single, isolated job. However, a very common workflow in scientific computing is to want to run the same computational process with different input data and produce different output data. For example, you might have some code which analyses some source data (genetic data, population data, whatever it may be) and then generates an output - and you want to run this same code multiple times on different sets of input data to produce multiple sets of output data. In cases like these, we recommend using an array job.

An array job is suitable when you wish to run the same code multiple times and where each run of the code is independent of the others i.e. each run of the code uses its own input source and does not depend on the output of any other run of the code.

An array job is made up of a number of tasks, with each task having a specific task id

has a single submission script that gets executed for each task in the array
has a single job id

The tasks get scheduled separately such that many tasks can be executing at the same time. Individual tasks in an array job must not depend upon each other as there is no guarantee any one task will have completed before another starts. There are a number of benefits to choosing to use array jobs rather than submitting lots of nearly identical jobs. As the entire array is just a single job, it is easier for you to keep track of and manage your jobs. Also, array jobs put much less load on the job scheduling system than the equivalent number of single jobs.

Declaring an array job

Array jobs are declared using the ‘--array’ argument to sbatch and specifying parameters for the array:

--array n[-m[:s]]

: n is the first task id;

m is the last task id;

s is the task id step size (default is 1).

If only n is specified, only one task with ID n is launched.

Task IDs can also be comma separated. For example, “--array=0,6,16-32".

Examples

sbatch --array 1-10 myscript.sh - run myscript.sh with tasks numbered 1-10
sbatch --array 2-10:2 myscript.sh        - run myscript.sh with tasks numbered 2,4,6,8,10
sbatch --array 23 myscript.sh        - run myscript.sh with a single task numbered 23 (useful if you need to re-run a an individual task)
sbatch --array 2,23,48,51 myscript.sh        - run myscript.sh with tasks numbered 2,23,48,51 (useful if you need to re-run a a particular subset of tasks)

The maximum number of simultaneously running tasks from the job array can be specified with the “%” separator. For example, specifying “--array=0:999%100” would limit the number of tasks from the job array running simultaneously to 100.

The following restrictions apply to the values n and m:

0 <= n < MIN(4000001, MaxArraySize)
0 <= m < MIN(4000001, MaxArraySize)
n <= m

where MaxArraySize is defined in the cluster configuration (400001 at the time of writing).

The total number of tasks in the array must be less than MIN(MaxArraySize, max_array_tasks). At the time of writing max_array_tasks is set to 75001. If max_array_tasks is insufficient then please contact bmrc-help@medsci.ox.ac.uk to ask for it to be raised.

Environment variables for array jobs

The purpose of an array job is to allow the same code (a Python or R script, for example) to run multiple times with different inputs with each run producing different outputs. However, the Python or R script that you submit will be the same for all tasks in the array job - each job will run the very same script. Naturally, this raises the question of how this very same script will be able to read data from different inputs and write data to different outputs. There are two parts to the answer. First, the scheduler will ensure that each task in the array job will run in an environment with different environment variables. The full list of environment variables for array jobs in shown below, but perhaps the most important is SLURM_ARRAY_TASK_ID. In each task, the scheduler will ensure that this variable will be set to a particular task number I.e. in task 1, the value of SLURM_ARRAY_TASK_ID will be 1, while in task 2, it will be 2. So, although your script is the same for each task, the environment is different. The second part of the answer is that your script must then use these differing environment variables to dynamically decide which input files to read and where to write its outputs.

One approach to making your scripts dynamic would be to make sure that your input files are named in a convenient way. For example, if your input files are labelled data1.txt, data2.txt, …, data9.txt, then you could submit an array job with tasks numbered 1-9 and use the SLURM_ARRAY_TASK_ID to dynamically set an input file name and output file name. Alternatively, you might have a manifest file - I.e. a text file containing a list of filenames, one per line. Then your script could use SLURM_ARRAY_TASK_ID to read the file on a particular line for its input. In general, you have the full power of whichever programming language you are using (Python, R, Julia, etc) to make your code respond dynamically to the differing value of SLURM_ARRAY_TASK_ID.

Slurm sets the following task-related environment variables:

SLURM_ARRAY_JOB_ID - job array's master job ID number

SLURM_ARRAY_TASK_COUNT - total number of tasks in a job array

SLURM_ARRAY_TASK_ID - the task ID of this task

SLURM_ARRAY_TASK_MAX - the task id of the last task in the array (note that this could be, but is not necessarily, the parameter 'm' given to '--array' depending on 'n' and 's')

SLURM_ARRAY_TASK_MIN - the task id of the first task in the array (parameter ‘n’ given to the ‘--array’ argument)

SLURM_ARRAY_TASK_STEP - the size of the step in task id from one task to the next (parameter ‘s’ given to the ‘--array’ argument)

Simple array job example

#!/bin/bash

# This script sets up a task array with a step size of one.

#SBATCH -J TestSimpleArrayJob
#SBATCH -p short
#SBATCH--array 1-527:1
#SBATCH --requeue

echo `date`: Executing task ${SLURM_ARRAY_TASK_ID} of job ${SLURM_ARRAY_JOB_ID} on `hostname` as user ${USER}
echo SLURM_ARRAY_TASK_MIN=${SLURM_ARRAY_TASK_MIN}, SLURM_ARRAY_TASK_MAX=${SLURM_ARRAY_TASK_MAX}, SLURM_ARRAY_TASK_STEP=${SLURM_ARRAY_TASK_STEP}

##########################################################################################
#
# Do any one-off set up here
#
## For example, set up the environment for R/3.1.3-openblas-0.2.14-omp-gcc4.7.2:
## . /etc/profile.d/modules.sh
## module load R/3.1.3-openblas-0.2.14-omp-gcc4.7.2
## which Rscript
#
##########################################################################################

##########################################################################################
#
# Do your per-task processing here
#
## For example, run an R script that uses the task id directly:
## Rscript /path/to/my/rscript.R ${SLURM_ARRAY_TASK_ID}
## rv=$?
#
##########################################################################################

echo `date`: task complete
exit $rv

Array job example using step size

#!/bin/bash

# This script sets up a task array with one task per operation and uses the step size
# to control how many operations are performed per script run, e.g. to manage the
# turnover time of the tasks. This also makes it a bit easier to re-run a specific
# task than using a step size of one and an unrelated loop counter inside the script

#SBATCH -J TestArrayJobWithStep
#SBATCH -p short
#SBATCH -t 1-527:60
#SBATCH --requeue

echo `date`: Executing task ${SLURM_ARRAY_TASK_ID} of job ${SLURM_ARRAY_JOB_ID} on `hostname` as user ${USER}
echo SLURM_ARRAY_TASK_MIN=${SLURM_ARRAY_TASK_MIN}, SLURM_ARRAY_TASK_MAX=${SLURM_ARRAY_TASK_MAX}, SLURM_ARRAY_TASK_STEP=${SLURM_ARRAY_TASK_STEP}

##########################################################################################
#
# Do any one-off set up here
#
## For example, set up the environment for R/3.1.3-openblas-0.2.14-omp-gcc4.7.2:
## . /etc/profile.d/modules.sh
## module load R/3.1.3-openblas-0.2.14-omp-gcc4.7.2
## which Rscript
#
##########################################################################################

# Calculate the last task id for this step
this_step_last=$(( SLURM_ARRAY_TASK_ID + SLURM_ARRAY_TASK_STEP - 1 ))
if [ "${SLURM_ARRAY_TASK_MAX}" -lt "${this_step_last}" ]
then
    this_step_last="${SLURM_ARRAY_TASK_MAX}"
fi

# Loop over task ids in this step
while [ "${SLURM_ARRAY_TASK_ID}" -le "${this_step_last}" ]
do
    echo `date`: starting work on SLURM_ARRAY_TASK_ID=`printenv SLURM_ARRAY_TASK_ID`

##########################################################################################
#
#   Do your per-task processing here
#
## For example, run an R script that uses the task id directly:
## Rscript /path/to/my/rscript.R ${SLURM_ARRAY_TASK_ID}
## rv=$?
#
##########################################################################################

    # Increment SGE_TASK_ID
    export SLURM_ARRAY_TASK_ID=$(( SLURM_ARRAY_TASK_ID + 1 ))
done

echo `date`: task complete
exit $rv

Mapping task-specific arguments from task ids The examples above just use the task id directly but this isn’t always possible. There are a number of ways that one might be able to map the task id to the task-specific arguments but if your requirements are complicated then you may have to write some code to do it. However, there are some simple options.

If you have a file that lists the task-specific arguments, with the arguments for the nth task on the nth line:

cat /path/to/my/argument/list | tail -n+${SLURM_ARRAY_TASK_ID} | head -1 | xargs /path/to/my/command

Job dependencies

Once your cluster workflows reach a certain level of complexity, it becomes natural to want to order them into stages and to require that later jobs commence only after earlier jobs have completed. The scheduler includes dedicated parameters to assist with creating pipelines to meet this requirement.

To ensure that job B will commence only after job A has finished successfully, you first submit job A and make a note of its job id number. Now you can submit job B and specify that it should be held until job A finishes and returns an exit code of 0:

sbatch -p short -d afterok:<jobA_id> jobB.sh

If wishing to make such hold requests in a shell script, it may help to submit job A using the --parsable parameter so that (in the case of our cluster) only its job id number is returned. You can then collect this id number in a shell variable and use it when submitting job B like so:

JOBA_ID=$(sbatch --parsable -p short jobA.sh)

sbatch -p short -d afterok:$JOBA_ID jobB.sh

Note there are other possible state requirements than afterok and they can be strung together with either commas for all the dependencies to be met or question marks for any of them to be met. Only one separator can be used. Please see the sbatch man page for details.

It is also possible to use an advanced form of job holding when you have sequential array jobs and you wish the tasks in the second job to start as soon as the corresponding task in the first job has completed. For example, if jobA is an array job with tasks numbered 1-100 and jobB is another array job also with tasks number 1-100 then one can arrange for job B's subtasks to run as soon as the corresponding task in job A has completed by running:

JOBA_ID=$(sbatch --parsable -p short --array 1-100 jobA.sh)

sbatch -p short --array 1-100 -d aftercorr:$JOBA_ID jobB.sh

Using this method, job B's tasks will be allowed to run (subject to availability on the cluster) as soon as the corresponding task in jobA has completed. Note that since job A's tasks may have different run-time durations and so may complete in any order, job B's tasks may commence in any order.

Troubleshooting

Cluster jobs can go wrong for many different reasons. If one of your computing jobs hasn’t behaved as expected, here are some suggestions on how to diagnose the issue. To begin, we focus on jobs which have already run to completion.

Check the job report using sacct

Login to the cluster and run: sacct -j <job-id>. This will print a summary report of how the job ran which looks as follows:

$ sacct -j 84528

JobID JobName Partition Account AllocCPUS State ExitCode

------------ ---------- ---------- ---------- ---------- ---------- --------

84528 my-job short <account> 1 COMPLETED 0:0

84528.batch batch <account> 1 COMPLETED 0:0

Now check the state. If state is COMPLETED then your job ran successfully to completion (from a code execution point of view). Otherwise, look up the State in the table in the sacct man page. Acouple of states worth noting are OUT_OF_MEMORY, which indicates that the job experienced an out of memory error and If state is FAILED the job terminated with a non-zero exit code and you should check the ExitCode column.

If sacct -j reports that your job ran to completion (i.e. if State is COMPLETED) then the next step is to examine the .out and .err log files.

UGE to Slurm conversion

What I want to do	What I did for UGE	What I do for Slurm
Submit to short queue	qsub -q short.qc job_script.sh	sbatch -p short job_script.sh
Submit to long queue	qsub -q long.qc job_script.sh	sbatch -p long job_script.sh
Submit 4-cpu job	qsub -q short.qc -pe shmem 4 job_script.sh	sbatch -p short --cpus-per-task 4 job_script.sh
Submit to E nodes	qsub -q short.qc@@short.hge job_script.sh	sbatch -p short --constraint=hge job_script.sh
Submit with 1 cpu, 20G memory	Not previously possible	sbatch -p short --mem-per-cpu=20G job_script.sh
Submit to skylake-compatible node	qsub -q short.qe,short.qa job_script.sh	sbatch -p short --constraint=skl-compat job_script.sh
Submit to brienne	qsub -q brienne.q@brienne job_script.sh	sbatch -p fraser -w brienne job_script.sh

Interactive job on short queue	qlogin -q short.qc	srun -p short --pty bash
10-cpu interactive short job	qlogin -q short.qc -pe shmem 10	srun -p short --cpus-per-task 10 --pty bash

Set submission arguments within my script	#$ -q short.qc #$ -pe shmem 6	#SBATCH -p short #SBATCH --cpus-per-task=6

Default partition is “short”, with a default memory per cpu of 15.2G and a runtime limit of 30 hours.

For conversion between SLURM, LSF, PBS/Torque and SGE commands please refer to the table here.

Cookies on this website