Using the BMRC Cluster
Introduction to Cluster Computing
The purpose of a shared computing cluster is to allow a large number of users to do useful computing work in a way that is typically faster, easier, better organised, more reliable and more cost-effective than running it on personal desktops or laptops. In basic terms, a computing cluster comprises some computers and shared disk storage, plus some special software to coordinate all the different computing work that different users wish to run.
A little terminology:
- Computers in a cluster are often called nodes.
- A computing cluster is designed to carry out computing jobs. A job involves running a piece of code (whether simple or complex) which eventually terminates and usually produces an output. For example, analysing some data in Python or R is a job; likewise, running a machine learning task which eventually terminates is also a job. Conceptually, all of the useful computing work that you can do on a cluster is organised into jobs.
The BMRC cluster comprises (similarly to other clusters):
- Two computers called login nodes (also gateway nodes). These are the normal entrypoint into the cluster - to use the cluster you start by logging into one of the login nodes.
- A number of worker nodes or cluster nodes, each of which is a separate computer which can run several computing jobs.
- A fast and robust shared disk space that is accessible (with appropriate permissions and security) to both the login nodes and the cluster nodes.
- A special piece of software called the scheduler which coordinates all the computing jobs that users wish to run.
- A wide range of pre-installed scientific software plus the means to add additional custom software.
Typical Cluster Workflow
To understand the BMRC cluster, it will help to visualise it with a simplified diagram and to describe a typical user workflow.

A typical workflow when using the cluster is as follows. First, you use your own computer to login to a login node. Then you prepare and submit one or more computing jobs to the scheduler. The scheduler then takes care of choosing which cluster node will run your job. In your job, you can use pre-installed software or you can provide your own. You can also use data stored on disk as inputs to your computing job. Finally, the output files produced by your jobs, together with some log files containing information about where and how your job ran, are also stored in your disk space. Later, you re-login to a login node and view the outputs of your computing jobs on disk.
As you have just seen, the BMRC cluster comprises some dedicated login nodes together with some dedicated cluster nodes. It is important to remember that the login nodes are reserved solely for the purpose of preparing and submitting computing jobs to the scheduler, which will then send them to a cluster node for execution. For this reason, please note that computing jobs MUST NOT be run directly on the login nodes themselves. Any significant jobs running on login nodes (i.e. without submitting them to the scheduler) are liable to be stopped by the BMRC team. This policy helps to ensure that the login nodes are always available to everyone for their intended use.
The BMRC Cluster in Numbers
Here a few headline statistics regarding the BMRC cluster (as of January 2020).
- ~7000 CPU cores and ~50 GPUs. All computers run a version of the Linux operating system.
- 7PB (7 million Gigabytes) of fast, shared storage serving data at up to 30GB/s
- The scheduling software used by BMRC is Univa Grid Engine. It currently manages ~ 24 million computing jobs per year.
Overview of Submitting a Job to the Cluster
A shared computing cluster is used by many different users to run many different types of computing job. One of the functions of a computing cluster and the scheduler is to handle this variation. In particular:
- The BMRC cluster is made of groups of different computers which vary by:
- CPU architecture, number of CPUs and CPU cores
- RAM type and quantity
- whether they include GPU processing.
- The computing jobs which users wish to run vary by:
- how long they will take to run
- how many CPU cores and much RAM they will use
- whether they will require a GPU
- whether an individual job can be divided into subjobs which can run in parallel
Consequently, it is the job of the scheduler to distribute many different kinds of job among many different types of cluster node in a way that is fair and makes efficient use of available compute resources. To assist in this task, the scheduler runs a number of different queues. Each queue provides different sets of resources from other queues and users are required, when submitting a job mto choose which queue is most suitable. The full details of the BMRC cluster queues are shown below.
Choosing the best queue is sometimes a process of trial and error, until you have a better idea of the requirements of your computing job. In most cases, the single most significant variable between queues is the maximum time for which a job can run. If your computing job will complete in under 30 hours, then you should choose one of our short queues. For longer jobs, we have long queues where the maximum running time is ten days. Full details of all available queues are shown below.
If you anticipate needing to run a computing job for longer than 10 days, please contact us to discuss.
It is important to choose the right queue because the scheduler will automatically delete a computing job if it attempts to exceed the time limit - even if the job has not finished. This can result in wasted time for you and wasted resources for other users — and you still won't have your results.
Cluster Queues and Nodes
Cluster Node Groups
The table below shows the groups of computers which form our cluster nodes.
Node Name Prefix | Num. of nodes | Num. of cores | Total Cores | Total Node Memory (GB) | Memory per Slot (GB) | Queues | CPU Type | Chip Speed (GHz) |
---|---|---|---|---|---|---|---|---|
A | 48 | 48 | 2304 | 768 | 15 | short.qc,long.qc | Intel Skylake | 2.4 |
D | 32 | 24 | 768 | 384 | 15 | short.qc,long.qc | Intel Haswell | 2.5 |
E | 102 | 24 | 2,448 | 384 | 15 | short.qc,short.qe,long.qc | Intel Skylake | 2.6 |
F | 16 | 40 | 640 | 192 | 4.8 | short.qf,long.qf | Intel Skylake | 2.4 |
H | 3 | 48 | 144 | 2000 | 41.25 | himem.qh | Intel Ivybridge | 3 |
How to read the table: The first line of the table displays that in our first group of cluster nodes, designated by the letter A, we have 48 computing nodes (i.e. 48 servers). These are named compa000 through to compa047. Each of these nodes contains 48 CPU cores each with Intel Skylake architecture at clock speed 2.4GHz and a total of 768GB of RAM. These A nodes are used by two queues, short.qc and long.qc.
The computers we use as cluster nodes are very powerful, typically much more powerful than is needed for a typical computing job. Consequently, each individual node is divided up into a number of virtual computer units comprised of 1 virtual CPU and some RAM. A virtual computer and RAM allocation are collectively called a slot. The final piece of information in the table is that the Memory per Slot for A group nodes is 15GB. By default, a individual computing job will run with one slot and so on our A group nodes it will run with one virtual CPU and 15GB of RAM. When a computing job needs a greater amount of RAM and/or would benefit from using more than one virtual CPU (e.g. in parallel computing), it is possible request more than one slot from the scheduler. Requesting two slots, for example, would mean that your job will run with 2 virtual CPUs and 2 x 15 = 30GB RAM. Of course, you should request extra slots only when they are needed or when testing whether would be of benefit.
Cluster Queues
The full list of cluster queues and maximum job durations is shown below.
Name | Maximum Duration |
---|---|
test.qc | 1 minute (i.e. for testing purposes only). |
short.qc, short.qe, short.qf | 30 hours |
long.qc | 10 days |
long.qf | 14 days |
himem.qh | 14 days |
gpu8.q | 60 hours |
Interactive and Non-Interactive Cluster Use: qlogin and qsub
There are two different but complementary ways of getting your code running on the cluster.
qlogin
Using the qlogin command you can request an interactive cluster session in which you are given a live terminal (aka shell) on one of the cluster nodes with the amount of slots (CPU and RAM) you have requested. An interactive cluster session allows you to type commands and run code live from the terminal, just as you would from cluster1-2, but with the benefit to yourself of giving you dedicated resources and with the benefit to others that your code is not running on the login nodes.
Since you are given an interactive BASH shell, your shell environment will automatically execute both your ~/.bash_profile and ~/.bashrc files to perform its setup. This is different to qsub scripts where no startup files are processed (neither ~/.bashrc nor ~/.bash_profile).
To request a simple qlogin session, simple run:
qlogin -q short.qc
The output you see on screen will look as follows:
[username@rescomp1] $ qlogin -q short.qc
Your job 1234567 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 1234567 has been successfully scheduled.
Establishing /mgmt/uge/8.6.8/util/qlogin_wrapper.sh session to host compa002.hpc.in.bmrc.ox.ac.uk ...
Last login: Tue Jan 7 13:45:55 2020 from rescomp1
[username@compa002 ~] $
In the above output, you can read that your qlogin session has been successfully scheduled and you have been given an interactive cluster session on compa002. Notice how the prompt changes to [username@compa002] to show that you are now working on one of the cluster nodes rather than on rescomp1-2.
When you have finished with your interactive session, logout as normal by typing e.g. exit and you will return to rescomp1-2. If you wish to take a break from your interactive session then we recommend starting your interactive job from within a screen or tmux session and allowing screen or tmux to keep your interactive session alive.
The qlogin command accepts many of the same parameters and behaves similarly to the qsub command described in detail below. In particular, please note that the queue constraints still hold i.e. a session started with qlogin -q short.qc is still running on short.qc and subject to the same maximum duration of 30 hours.
qsub
Use of qsub to submit scheduled i.e. non-interactive jobs is described in detail below.
Submitting Jobs using qsub - Step by Step Guide for New Users
Prequisites:
- You will need a cluster account
- You need to know how to login.
- You will need to know which projects you have access to (please ask your PI/group administrator).
First logon to cluster1.bmrc.ox.ac.uk
following our access guide and ensure that you are in your home directory by running:
$ cd ~
In practice, the easiest and most flexible method to submit a cluster job is to prepare and then submit a bash script file. This bash script file can contain configuration settings for the scheduler, can load pre-installed software, can set any environment variables that your software needs, and then can launch your main software. A bash script is also useful when doing repetitive work such as running the same analysis several times with different input files (see array jobs below).
We begin with a simple example. Once logged into cluster1.bmrc.ox.ac.uk
run the command shown below. Note that you should replace project.prjc with the name of your own project:
$ qsub -P project.prjc -q short.qc $SGE_ROOT/examples/jobs/simple.sh
You job 4005 ("simple.sh") has been submitted.
This will submit the example script (simple.sh
) located at $SGE_ROOT/examples/jobs/simple.sh
. (NB Note that the dollar symbol in $SGE_ROOT indicates that this is a shell variable. You can see the full path to simple.sh
by running e.g. echo $SGE_ROOT/examples/jobs/simple.sh
. When you submit this job, the scheduler will report back that it has received your job and assigned it an ID number - see below for why this number is very important.
To run even such as simple task as this on the cluster involves a number of important steps that it will be helpful to understand. These are now discussed below.
Job ID
Note first that every job submitted to the scheduler receives a JOB ID number. In the example above, the job id number was 4005 but your job id number will be different (indeed, because the BMRC cluster handles so many jobs, it is not unusual to have a job id number in the millions).
The ID number of your computing job is the most important piece of information about your job. When you need to contact the BMRC team for help, please remember to tell us what your job id was. This will make it considerably easier for us to help you.
Log Files
When you run a cluster job, the scheduler automatically produces two files, which it names in the format <jobname>.o.<jobid>
and <jobname>.e.<jobid>
. The first is the output file and the second is the error file.
If your job (“simple.sh”) has already finished, you will be able to see these files in your current directory by running:
$ ls
simple.sh.o.4005 simple.sh.e.4005
NB If running the ls
command does not show these files, wait a few more seconds and try again.
To see the contents of the log file, you can use the cat
command. (For longer log files, the less
command is probably more useful). Looking at the .o file should show similar output to what we saw when running sum.sh
directly earlier.
qsub
parameters and environment
As we have seen above, the scheduler software can be used very simply. However, it is also highly adaptable by passing certain configuration parameters. We will look at some of those parameters in more detail below. For now, we will focus on one important parameter - the environment.
When working in the shell in Linux, your commands are run in a particular context called the environment, which contains certain variables and their values. Your environment contains, for example, the PWD variable which stores which directly you are currently working in. You can see this by running echo $PWD
. When you run a command directly at the shell, your command inherits your current environment; however, submitting jobs to the scheduler is different. These jobs will not run in your current shell and, indeed, they will not even run on the same computer you are currently logged into. What value, then, should the scheduler use for your current working directory (i.e. the PWD variable)?
To solve the problem with your current working directory, the scheduler works with a simple policy: unless you explicitly state what you want to the working directory to be, the scheduler will assume you want it to be your home directory. That is why, when you submitted your first job (“simple.sh”) above, your log files (.o and .e files) appeared there. It is important to understand that, although you were working in your home directory, that is not why the scheduler wrote your log files there. Instead, the scheduler wrote the log files to your home directory because you did not specify a different working directory, and when that is the case the scheduler assumes you want to work in your home directory.
Changing the behaviour of the qsub
command can be achieved in three different ways:
- Extra parameters can be added directly on the command line. In the example above, you already used this method to specify which project your job should be associated with by adding
-P project.prjc
. Similarly, to tell the scheduler to use the current working directory (where you are now) as the current working directory when your script is run, you can add the-cwd
parameter as follows:$ qsub -P project.prjc -cwd $SGE_ROOT/examples/jobs/simple.sh
Using -cwd on its own as we do here means use the current working directory at the time of submitting to qsub as the current working directory for the script. It is also possible to specify a fixed directory using e.g. -wd /path/to/my/files (note that the parameter name here is -wd for working directory. - Extra parameters can be added to your script files using the special script syntax
#$
. For example, instead of adding-P
or-cwd
to the command line you could add them to your bash script file like this:... #$ -P project.prjc #$ -cwd ...
- You can also set parameters in a configuration file at
$HOME/.sge_request
. This is probably best used for configuration settings that you won’t change very often. For example. if you create the file$HOME/.sge_request
and add these lines (similar to option 2)…#$ -P project.prjc #$ -cwd
… then, these settings will be applied to every job you submit, unless you override them via methods 1 or 2. (Parameters in the script file override parameters on the command line which in turn override parameters in your local configuration file).
IMPORTANT - Avoiding inheriting environments with the -v parameter
Grid Engine permits the usage of the -V parameter which instructs the scheduler to make a copy of the user's shell environment, including all environment variables, at the time of submission and to add these to the environment when your scheduled job is run on a cluster node. We strongly recommend NOT using the -V parameter. Instead, we recommend putting all the code needed to setup your desired environment into your bash submission script.
The -V parameter interferes with the automatically selection of software and modules as discussed in our module guide.
Writing your own bash script file
Now that we have practised with the example file at $SGE_ROOT/examples/jobs/simple.sh
, let’s try writing our own bash script file.
First of all, make a subdirectory in your home folder called qsub-test
and change directory into it:
$ mkdir qsub-test
$ cd qsub-test
Now copy the contents below into a file named python-hello.sh into your qsub-test
folder. Make sure to change the second line #$ -wd /path-to/qsub-test to reflect the path to your qsub-test folder.
#!/bin/bash
#$ -wd /path/to/qsub-test
#$ -P project.prjc
#$ -N my-python-job
#$ -q short.qc
echo "------------------------------------------------"
echo "Run on host: "`hostname`
echo "Operating system: "`uname -s`
echo "Username: "`whoami`
echo "Started at: "`date`
echo "------------------------------------------------"
module load Python/3.7.2-GCCcore-8.2.0
sleep 60s
python -c 'print("Hello")'
This script works as follows.
- The first line
$!/bin/bash
is a special command which says that this script should use the bash shell. You will normally want to include this in every bash script file you write. - The next three lines use the special
#$
comment syntax to add some parameters to ourqsub
command. In particular: - we specify our project name using
-P
(make sure to substitute your own project name). - We also specify
-wd /path/to/qsub-test
so that the script will be run with the qsub-test folder as its current directory. - We use
-N
to specify a name for this job. By default, jobs are named after their script files, but you can specify any name that would be helpful to you. - Finally, we use
-q short.qc
to specify that the job short run on that particular queue. - In the next lines (all beginning
echo...
) we print some useful debugging information. This is included here just for information. - The line
module load Python/3.7.2-GCCcore-8.2.0
uses the module system to load a version of Python. - The line
sleep 60s
causes our script to pause for 60 seconds. - The final line uses python to print the message ‘Hello’
Now, for sake of this example, we will move back to our home directory and submit the script file from there:
$ cd $HOME
$ qsub qsub-test/python-hello.sh
Your job 4006 ("my-python-job") has been submitted.
Now that your job has been submitted, if you are quick enough (within 60 seconds!) you can use the qstat
command to check the status of your job. The output will look something like the below.
$ qstat
job-ID prior name user state submit/start at queue jclass slots ja-job-ID
------------------------------------------------------------------------------------------------------------
4006 0.000 my-python-job <username> qw 01/09/2020 21:14:18 1
Because we told the scheduler to use the folder $HOME/qsub-test
as the working directory for this script, that is where it will put our log file (even though we submitted the script from our $HOME directory). Now let’s look at the log files:
$ cd qsub-test
$ ls
my-python-job.o.4006 my-python-job.e.4006
$ cat my-python-job.o.4006
------------------------------------------------
Run on host: compf003.hpc.in.bmrc.ox.ac.uk
Operating system: Linux
Username: <your-username>
Started at: Thu Jan 9 16:37:58 GMT 2020
------------------------------------------------
Hello
The output shows that this script ran on the node compf003.hpc.in.bmrc.ox.ac.uk
, however, you will likely see a different node. It also shows that the cluster node’s operating system was Linux, that the job was run as your username, and that the job started at the time shown.
Since this job ran successfully, the .e
file will be empty, which is a reassuring sign! You can check that it’s empty, if you like, using the cat
command.
Submitting to Specific Host Groups or Nodes
Sometimes, you may wish to target which cluster nodes your job will run more finely than by specifying just a queue. In this case, it is possible to add further information to your queue specification.
- Specifying e.g. qsub -q short.qc@@short.hga will run on the short queue AND only on A nodes, while qsub -q short.qc@@short.hge will run on the short queue AND only on the E nodes. The normal list of “hostgroups” that you can use for these purposes are:
- short.hga, short.hgd, short.hge, short.hgf
- long.hga, long.hgd, long.hge, long.hgf
It is rarely desirable to limit running your job to a particular node, but in cases where this is desirable, you can use the ‘@@’ specification similarly to specify a particular target node.
- Specifying e.g. qsub -q short.qc@@compd024 will run in short.qc AND only on node compd024.
Command Line Options and Script Parameters
The qsub
command can be used to run jobs very simply. However, it also accepts a wide range of configuration settings which we can add to the qsub command
. For example, the option -q
specifies which queue to submit to. For example, we could resubmit sum.sh
to a different queue as follows:
$ qsub -P group.prjc -q short.qe sum.sh
Typing all the desired options for job at the command line is certainly possible, but it also has disadvantages: not only does it require more typing, it makes it more difficult to record which options were used at a particular time. So, instead of having to type these options in every time, it is also possible to include them directly in our script file.
Using your favourite editor, let’s update the sum.sh
file as follows:
#!/bin/bash
#$ -P group.prjc
#$ -q short.qc
echo "------------------------------------------------"
echo "Run on host: "`hostname`
echo "Operating system: "`uname -s`
echo "Username: "`whoami`
echo "Started at: "`date`
echo "------------------------------------------------"
echo "1+1=$((1+1))"
The two additional lines use a special comment syntax, where a line begins #$
followed by a space, followed by the same parameter that you would use at the command line. Now that we have specified our group and project inside our script, we can run this job more simply than before like this:
% qsub sum.sh
Instead of writing your own job submission script from scratch yo can also use a copy of our job submission template file, located at /apps/scripts/sge.template.sh
which contains hints on a number of commonly used parameters. To use this file, begin my making a copy into your own home directory by running:
cp /apps/scripts/sge.template.sh ~/myjobscript.sge.sh
Then you can edit the file for your own purposes.
Checking or Deleting Running Jobs
A busy computing cluster handles many thousands of computing jobs every day. For this reason, when you submit a computing job to the cluster your job may not run immediately. Instead, it will be held in its queue until the scheduler decides to send it to a cluster node for execution. This is one important difference to what happens when you run a job on your own computer.
You can check the status of your job in the queue at any time using the qstat
command. Running qstat
on its own will report the status of every job of yours which remains in the queue. Alternatively, use qstat -j <jobid>
to report the status of of an individual job. The output of these commands will look like this:
job-ID prior name user state submit/start at queue jclass slots ja-job-ID
-----------------------------------------------------------------------------------------------------------
22355866 0.000 sum.sh <username> qw 01/09/2020 21:14:18 1
qstat
reports lots of useful information such as job id, submission time (or, if the job has started, its start time), and the number of reserved slots. However, perhaps the most important piece of information is in the state column. For example, qw means that the job is held in the queue and waiting to start. Alternatively, a state of r means tha the job is currently running.
One important thing to note is that the qstat
command reports only jobs which are either waiting to run or currently running. Once a job has finished qstat
will no longer report it. Instead, once a job has submitted you can see a detailed audit, including total run time and memory usage, by running qacct -j <jobid>
.
Occasionally, you may wish to delete a job before it start or before it completes. For this you can use the qdel -j <jobid>
command.
Preparing a bash script file
We’ll start by using the bash script template located at /apps/scripts/sge.template.sh
. You should take a copy of this file to use as a starting point for your own scripts. The file will begin with the following content:
#!/bin/bash
# Specify a job name
#$ -N jobname
# --- Parameters for the Queue Master ---
# Project name and target queue
#$ -P group.prjc
#$ -q test.qc
# Run the job in the current working directory
#$ -cwd -j y
# Log locations which are relative to the current
# working directory of the submission
###$ -o output.log
###$ -e error.log
# Parallel environemnt settings
# For more information on these please see the wiki
# Allowed settings:
# shmem
# mpi
# node_mpi
# ramdisk
#$ -pe shmem 1
# Print some useful data about the job to help with debugging
echo "------------------------------------------------"
echo "SGE Job ID: $JOB_ID"
echo "SGE Job ID: $SGE_JOB_ID"
echo "Run on host: "`hostname`
echo "Operating system: "`uname -s`
echo "Username: "`whoami`
echo "Started at: "`date`
echo "------------------------------------------------"
# Finally, we can run our real computing job
echo $JOB_ID
# End of job script
The first line #!/bin/bash
specifies that this is a bash script (this will always be the same).
Any line beginning with #
is a bash comment i.e. it will be ignored by the bash interpreter when the script file is actually run. Comments are most often there to make the code more human readable, i.e. to explain what the code does to any human readers. However, our template file also contains a number of special comments which begin #$
. These special comments which provide configuration settings and will be read and understood by the scheduler.
For example, it is often helpful to specify a name for your job - you can do this by changing the line #$ -N jobname
. Some of the configuration settings in the template file are optional but others are required. For example, submitting a job to the cluster requires that you specify a project to associate the job to. This is used for accounting and also defines which of your affiliations the work is attributed to. Each group that can use the cluster will be given a project, often associated with the PI of the group.
The penultimate section of the bash script template uses the echo
command to output some information which can be useful in debugging.
Finally, the script runs the actual computing job. In this case, the job is a trivial one of printing out the job id, which is achieved by the line echo $JOB_ID
. However, it is possible to achieve very complex things your bash script file including loading specific software modules, looping over a set of data or running a whole pipeline in sequence.
commonly used qsub Parameters
Run man qsub to see the full list of qsub options. However, here are some commonly used options as a guide:
-P <project>.prjc | Specify which project your job belongs to (required) |
-q <queuename> | Specify which queue to run a job in. See the full list of cluster queues |
-j y | Instead of separate output and error log files, send the error output to the regular output log file |
-o <dir>, -e <dir> | Put the output log file (-o) and/or error log file (-e) into the specified directories keeping their default file names. By default, they go into the current working directory. |
-o <path/to/file>, -e <path/to/file> | Put the output log file (-o) and/or error log file (-e) into the specified files. |
-cwd | Use the current working directory at the time of job submission (i.e. your location when you run qsub) as the current working directory when executing your script |
-wd </path/to/dir> | Use /path/to/dir as the working directory when executing your script |
-pe shmem N | Request N slots. See cluster queues reference for full details of our nodes. NB The higher the number of slots requested the longer it will normally take for your job to be scheduled. A practical maximum of 12 slots is possible on short and long queues. The theoretical maximum is 48 slots on A nodes, 24 slots on D and E nodes, 12 slots on F nodes and 46 slots on H nodes, but you are advised to request less than the maximum theoretical available in order for your jobs to be scheduled within a reasonable time. |
-pe ramdisk <config> | Use a ramdisk - see the ramdisk guide. |
-b y <executable> | Indicate that your job is not a shell script but a binary that should be run directly |
-S </path/to/shell> | By default, our cluster uses the Bash shell. Use -S if you need your job submission script to be interpreted by another shell. |
Array Jobs
If you have a set of jobs that perform the same action on many different inputs then you may be better submitting an SGE array job. An array job:
- is made up of a number of tasks, with each task having a specific task id
- has a single submission script that gets executed for each task in the array
- has a single job id
The tasks get scheduled separately such that many tasks can be executing at the same time. This means tasks should not depend upon each other as there is no guarantee any one task will have completed before another starts.
The task id, along with other relevant information, is passed to the job through some environment variables. The task id is a number. It may be possible to use this number directly in your script or it may be necessary to use the task id to identify the specific inputs for the task.
There are a number of benefits to choosing to use array jobs rather than submitting lots of nearly identical jobs. As the entire array is just a single job, it is easier for you to keep track of and manage your jobs. Also, array jobs put much less load on the job scheduling system than the equivalent number of single jobs.
Declaring an array job
Array jobs are declared using the ‘-t’ argument to qsub:
-t n[-m[:s]]
- : n is the first task id;
- m is the last task id;
- s is the task id step size (default is 1);
- If only n is specified, it is treated as ‘-t 1-n:1’.
For example, the task id range specified by 2-10:2 would result in the task id indexes 2, 4, 6, 8, and 10, for a total of 5 otherwise identical tasks.
The following restrictions apply to the values n and m:
1 <= n <= MIN(2^31-1, max_aj_tasks)
1 <= m <= MIN(2^31-1, max_aj_tasks)
n <= m
where max_aj_tasks
is defined in the cluster configuration (see qconf -sconf
, 75000 at the time of writing). If max_aj_tasks
is insufficient then please contact bmrc-help@medsci.ox.ac.uk to ask for it to be raised.
The number of tasks eligible for concurrent execution can be controlled by the ‘-tc’ argument to qsub:
-tc max_running_tasks
max_running_tasks
is the number of tasks in this array job that are allowed to be executing at the same time. For example, -t 1-100 -tc 2
would submit a 100-task array job where only 2 tasks can be executing at the same time. Note that this is the maximum allowed to run at the same time and does not meant that tasks will be run in pairs.
Environment variables
- SGE sets the following task-related environment variables:
- SGE_TASK_ID - the task id of this task
- SGE_TASK_FIRST - the task id of the first task in the array (parameter ‘n’ given to the ‘-t’ argument)
- SGE_TASK_LAST - the task id of the last task in the array (parameter ‘m’ given to the ‘-t’ argument)
- SGE_TASK_STEPSIZE - the size of the step in task id from one task to the next (parameter ‘s’ given to the ‘-t’ argument)
Step size and batching
There is a significant dead time between one job or task finishing and the next starting - sometimes as much as 15s. This time is spent reporting the job/task’s exit status and resource usage, tidying up after the job/task, waiting for the next run of the scheduler to select the next job/task and preparing for that job/task to run. If the action performed by the script takes more than a couple of hours to run, this dead time is fairly insignificant. If the action takes one second to run then running it once per job/task means that the dead time is 15 times longer than the run time and it will take very much longer than expected to finish the job. The best approach for such short job/task run times is to batch multiple runs together.
This can be done by using an array job with a step size > 1 and a loop within the script that loops over all the task ids in the step. The step size should be chosen such that the combined run time for all the tasks in the step is long enough to make the dead time insignificant without putting the task at risk of being killed for exceeding the queue’s run time limit. It is also polite to not hog the execution slot, so 2-4 hours is a good target. With a little bit of arithmetic it is possible to write the script in such a way that all that needs adjusting is the step size and the script takes care of the rest. See ‘Example using step size’ below for an example script.
Simple example
#!/bin/bash
# This script sets up a task array with a step size of one.
#$ -N TestSimpleArrayJob
#$ -q test.qc
#$ -t 1-527:1
#$ -r y
echo `date`: Executing task ${SGE_TASK_ID} of job ${JOB_ID} on `hostname` as user ${USER}
echo SGE_TASK_FIRST=${SGE_TASK_FIRST}, SGE_TASK_LAST=${SGE_TASK_LAST}, SGE_TASK_STEPSIZE=${SGE_TASK_STEPSIZE}
##########################################################################################
#
# Do any one-off set up here
#
## For example, set up the environment for R/3.1.3-openblas-0.2.14-omp-gcc4.7.2:
## . /etc/profile.d/modules.sh
## module load R/3.1.3-openblas-0.2.14-omp-gcc4.7.2
## which Rscript
#
##########################################################################################
##########################################################################################
#
# Do your per-task processing here
#
## For example, run an R script that uses the task id directly:
## Rscript /path/to/my/rscript.R ${SGE_TASK_ID}
## rv=$?
#
##########################################################################################
echo `date`: task complete
exit $rv
Example using step size
#!/bin/bash
# This script sets up a task array with one task per operation and uses the step size
# to control how many operations are performed per script run, e.g. to manage the
# turnover time of the tasks. This also makes it a bit easier to re-run a specific
# task than using a step size of one and an unrelated loop counter inside the script
#$ -N TestArrayJobWithStep
#$ -q test.qc
#$ -t 1-527:60
#$ -r y
echo `date`: Executing task ${SGE_TASK_ID} of job ${JOB_ID} on `hostname` as user ${USER}
echo SGE_TASK_FIRST=${SGE_TASK_FIRST}, SGE_TASK_LAST=${SGE_TASK_LAST}, SGE_TASK_STEPSIZE=${SGE_TASK_STEPSIZE}
##########################################################################################
#
# Do any one-off set up here
#
## For example, set up the environment for R/3.1.3-openblas-0.2.14-omp-gcc4.7.2:
## . /etc/profile.d/modules.sh
## module load R/3.1.3-openblas-0.2.14-omp-gcc4.7.2
## which Rscript
#
##########################################################################################
# Calculate the last task id for this step
this_step_last=$(( SGE_TASK_ID + SGE_TASK_STEPSIZE - 1 ))
if [ "${SGE_TASK_LAST}" -lt "${this_step_last}" ]
then
this_step_last="${SGE_TASK_LAST}"
fi
# Loop over task ids in this step
while [ "${SGE_TASK_ID}" -le "${this_step_last}" ]
do
echo `date`: starting work on SGE_TASK_ID=`printenv SGE_TASK_ID`
##########################################################################################
#
# Do your per-task processing here
#
## For example, run an R script that uses the task id directly:
## Rscript /path/to/my/rscript.R ${SGE_TASK_ID}
## rv=$?
#
##########################################################################################
# Increment SGE_TASK_ID
export SGE_TASK_ID=$(( SGE_TASK_ID + 1 ))
done
echo `date`: task complete
exit $rv
Mapping task-specific arguments from task ids The examples above just use the task id directly but this isn’t always possible. There are a number of ways that one might be able to map the task id to the task-specific arguments but if your requirements are complicated then you may have to write some code to do it. However, there are some simple options.
If you have a file that lists the task-specific arguments, with the arguments for the nth task on the nth line:
cat /path/to/my/argument/list | tail -n+${SGE_TASK_ID} | head -1 | xargs /path/to/my/command
Examples where this might be useful include processing a list of genes or snps.
Ramdisks
In some applications a dependency on a high-speed local file system is required, and for a lot of the nodes in the cluster the default ${TMPDIR}
is not very performant. To alleviate these issues you can allocate some of the memory from your job as a disk capable of handling temporary files. This has limitations, as you need a clear idea on what the upper limits will be required to both satisfy the memory for the job and the temp space required.
Tutorial
Instead of specifying the ‘shmem’ parallel environment variable to specify your slots, you ‘'’instead’’’ will use ‘ramdisk’ to specify your slots required:
-pe ramdisk <number_of_slots>
You will ‘'’also need to set a hard vmem limit’’’. This will set the limit of how much memory you want to be available to the application. This must not exceed the total memory available for the given slots on your selected node:
-l h_vmem=<memory_limit>
A ramdisk will be created at ${TMPDIR}/ramdisk
of size equal to the difference of the total memory available using the selected slots minus the hard limit of memory assigned for the job. The value of ${TMPDIR}
is that given to your script by SGE. If you change ${TMPDIR}
in your script then make sure your script makes a note of where the ramdisk is mounted before the change is made.
E.g. for a job that needed 20GB memory and some ramdisk, the arguments:
-P group.prjc -q short.qc -l h_vmem=20G -pe ramdisk 2 ./jobscript.sh
Given the default in that queue of 16G per slot, these flags would limit your job to 20G of memory but also create a 12G ramdisk using the remaining available memory provided by 2 slots.
Job Dependencies
Once your cluster workflows reach a certain level of complexity, it becomes natural to want to order them into stages and to require that later jobs commence only after earlier jobs have completed. The scheduler includes dedicated parameters to assist with creating pipelines to meet this requirement.
To ensure that job B will commence only after job A has finished, you first submit job A and make a note of either its job id number or its name (this is especially useful if you have assigned a name using -N). Now you can submit job B and specify that it should be held until job A finishes:
qsub -q short.qc -hold_jid <jobA_id_or_name> jobB.sh
If wishing to make such hold requests in a shell script, it may help to submit job A using the -terse parameter so that only its job id number is returned. You can then collect this id number in a shell variable and use it when submitting job B like so:
JOBA_ID=$(qsub -t -q short.qc jobA.sh)
qsub -q short.qc -hold_jid $JOBA_ID jobB.sh
It is also possible to use an advanced form of job holding when you have sequential array jobs and you wish the the tasks in the second job to start as soon as the correspnding task in the first job has completed. For example, if jobA is an array job with tasks numbered 1-100 and jobB is another array job also with tasks number 1-100 then one can arrange for job B's subtasks to run as soon as the corresponding task in job A has completed by running:
qsub -q short.qc -t 1-100 -N jobA jobA.sh
qsub -q short.qc -hold_jid_ad jobA jobB.sh
Using this method, which demonstrates the using holding a job by name rather than by job id, job B's tasks will be allowed to run (subject to availability on the cluster) as soon as the corresponding task in jobA has completed. Note that since job A's tasks may have different run-time durations and so may complete in any order, job B's tasks may commence in any order.
Troubleshooting
Cluster jobs can go wrong for many different reasons. If one of your computing jobs hasn’t behaved as expected, here are some suggestions on how to diagnose the issue. To begin, we focus on jobs which have already run to completion.
Check the Job Report using qacct
Login to the cluster and run: qacct -j <job-id>
. This will print a summary report of how the job ran which looks as follows:
# qacct -j 12345678
==============================================================
qname short.q
hostname <hostname>
group users
owner <user>
project NONE
department defaultdepartment
jobname <jobname>
jobnumber 12345678
taskid undefined
account sge
priority 0
qsub_time Fri Jan 24 195835 2020
start_time Fri Jan 24 195842 2020
end_time Fri Jan 24 195943 2020
granted_pe NONE
slots 1
failed 0
exit_status 0
ru_wallclock 61
ru_utime 0.070
ru_stime 0.050
ru_maxrss 1220
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 2916
ru_majflt 0
ru_nswap 0
ru_inblock 0
ru_oublock 176
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 91
ru_nivcsw 8
cpu 0.120
mem 0.001
io 0.000
iow 0.000
maxvmem 23.508M
arid undefined
Now run through the following checks:
- Check the Failed status. If Failed is 0 then your job ran successfully to completion (from a code execution point of view). However, if Failed is 1 then check the Exit Status.
- The Exit Status reported by
qacct -j
will be a number from 0 to 255. If the Exit Status > 128 (i.e. Exit Status is 128+n) then the computing job was killed by a specific shell signal whose id number is n. You can find which signal corresponds to n by runningman 7 signal
e.g. on cluster1.bmrc.ox.ac.uk. For example, Exit Status 137 is 128+9 where 9 is the signal for SIGKILL, which means that a job was automatically terminated, normally because it exceeded either its duration or its maximum RAM. - Check maxvmem. This figure should not exceed the RAM limit of the queue (see table).
If qacct -j
reports that your job ran to completion (i.e. if Failed is 0) then the next step is to examine the .o and .e log files.
Troubleshooting - Why did my job fail?
Every job submitted via qsub receives a job id number - this is reported back to you when submitting the job.
The job id number is the most important piece of information in diagnosing a job. If you are contacting the BMRC team for help, please make sure to include your job id number.
If a computing job has failed (i.e. has ceased to run), try these steps in order:
- Check your log files, in particular the error log files. By default these files are named by combining your job script name, your job id and the letter 'o' for regular output or the letter 'e' for error output. For example, the script analyse.sh submitted as job id 2001524 would produce log files named analyse.o2001524 and analyse.e2001524. Checking the error log file will show if there are error messages that might help explain why your job failed. Some common problems are these:
Message in error log file Explanation & Resolution illegal instruction... You are running software built for one CPU architecture on another incompatible CPU type. The most common cause of this happening is the inclusion of the -V parameter to qsub, either on the command line or in your scripts.
We strongly recommend against using the -V parameter because it interferes with the normal process whereby our system automatically chooses the correct software for each system's CPU architecture. Instead, please include all necessary environment setup in your job submission script.module command not found Please check that the first line of your script file is #!/bin/bash (with no empty lines preceding it). If you find that a script works when run interactively but produces a "module not found error" when submitted, this is likely the issue. - If the log files have not helped, you can view the accounting information for this job by running qacct -j <job_id>. For example, with the example job mentioned above you would run qacct -j 2001524.
The accounting file shows lots of information. To begin with, focus on the values shown for: exit_status and failed. Some common problems are shown in the table:
Exit Status / Failed Message Explanation & Resolution 52 : cgroups enforced memory limit
OR
Exit Status: 137. Failed: 100 : assumedly after jobIt's likely that your job exceeded the maximum RAM allocation. Check the maxvmem value reported by qacct against the queue specification (in general, values over 15.2GB are exceeding the maximum).
You can either reduce the memory requirements of your script or increase the amount of RAM available to your job by selecting extra slots (see the section on parallel environments in the cluster guide).
Exit Status 137: Failed: 44 : execd enforced h_rt limit
Your job exceeded the maximum duration for its queue (see the queue time limits shown here). If you were using a short queue you could re-submit to the long queue to get extra time. Alternatively, you will need to try to shorten your run time so that your code completes in the time available - please contact us for advice. Exit Status: 0. Failed 27 : searching requested shell Your job submission script could not be read because it was prepared on Microsoft Windows and its text format is incompatible. Please run dos2unix <script.sh> on your script file to convert it to Linux format and then resubmit.
Working with Anonymous User Accounts and Group Reports
All usernames on BMRC systems are created in an anonymised format. To help users identify and discover real names, a show command is provided at /apps/misc/utils/bin/show. To save typing, we recommend adding an alias to your ~/.bashrc file by adding this line:
alias show='/apps/misc/utils/bin/show'
After adding this line and running source ~/.bashrc you will be able to run the show command directly without needing to type the full path.
On its own, or with a path, show will produce a directory listing with real names included
show
show /well/mygroup/users
To see a user's real name, use show -u
show -u abc123
To see a listing of all members in a particular group, use show -g
On its own, show -g will show members of your primary group
show -g
Alternatively, specify a group to show
show -g groupname
To find a particular person's username, you can search by real name using show -w
show -w sarah
NB The show command is restricted to showing groups to which you belong and users in groups to which you belong.