Cookies on this website
We use cookies to ensure that we give you the best experience on our website. If you click 'Continue' we'll assume that you are happy to receive all cookies and you won't see this message again. Click 'Find out more' for information on how to change your cookie settings.

Quick Links

Introduction to Cluster Computing

The purpose of a shared computing cluster is to allow a large number of users to do useful computing work in a way that is typically faster, easier, better organised, more reliable and more cost-effective than running it on personal desktops or laptops. In basic terms, a computing cluster comprises some computers and shared disk storage, plus some special software to coordinate all the different computing work that different users wish to run.

A little terminology:

  • Computers in a cluster are often called nodes.
  • A computing cluster is designed to carry out computing jobs. A job involves running a piece of code (whether simple or complex) which eventually terminates and usually produces an output. For example, analysing some data in Python or R is a job; likewise, running a machine learning task which eventually terminates is also a job. Conceptually, all of the useful computing work that you can do on a cluster is organised into jobs

The BMRC cluster comprises (similarly to other clusters):

  • Two computers called login nodes (also gateway nodes). These are the normal entrypoint into the cluster - to use the cluster you start by logging into one of the login nodes.
  • A number of worker nodes or cluster nodes, each of which is a separate computer which can run several computing jobs.
  • A fast and robust shared disk space that is accessible (with appropriate permissions and security) to both the login nodes and the cluster nodes.
  • A special piece of software called the scheduler which coordinates all the computing jobs that users wish to run.
  • A wide range of pre-installed scientific software plus the means to add additional custom software.

Typical Cluster Workflow

To understand the BMRC cluster, it will help to visualise it with a simplified diagram and to describe a typical user workflow.

cluster-schematic.png

A typical workflow when using the cluster is as follows. First, you use your own computer to login to a login node. Then you prepare and submit one or more computing jobs to the scheduler. The scheduler then takes care of choosing which cluster node will run your job. In your job, you can use pre-installed software or you can provide your own. You can also use data stored on disk as inputs to your computing job. Finally, the output files produced by your jobs, together with some log files containing information about where and how your job ran, are also stored in your disk space. Later, you re-login to a login node and view the outputs of your computing jobs on disk.

 As you have just seen, the BMRC cluster comprises some dedicated login nodes together with some dedicated cluster nodes. It is important to remember that the login nodes are reserved solely for the purpose of preparing and submitting computing jobs to the scheduler, which will then send them to a cluster node for execution. For this reason, please note that computing jobs MUST NOT be run directly on the login nodes themselves. Any significant jobs running on login nodes (i.e. without submitting them to the scheduler) are liable to be stopped by the BMRC team. This policy helps to ensure that the login nodes are always available to everyone for their intended use.

The BMRC Cluster in Numbers

Here a few headline statistics regarding the BMRC cluster (as of January 2020).

  • ~7000 CPU cores and ~50 GPUs. All computers run a version of the Linux operating system.
  • 7PB (7 million Gigabytes) of fast, shared storage serving data at up to 30GB/s
  • The scheduling software used by BMRC is Univa Grid Engine. It currently manages ~ 24 million computing jobs per year.

Overview of Submitting a Job to the Cluster

A shared computing cluster is used by many different users to run many different types of computing job. One of the functions of a computing cluster and the scheduler is to handle this variation. In particular:

  • The BMRC cluster is made of groups of different computers which vary by:
    • CPU architecture, number of CPUs and CPU cores
    • RAM type and quantity
    • whether they include GPU processing.
  • The computing jobs which users wish to run vary by:
    • how long they will take to run
    • how many CPU cores and much RAM they will use
    • whether they will require a GPU
    • whether an individual job can be divided into subjobs which can run in parallel

Consequently, it is the job of the scheduler to distribute many different kinds of job among many different types of cluster node in a way that is fair and makes efficient use of available compute resources. To assist in this task, the scheduler runs a number of different queues. Each queue provides different sets of resources from other queues and users are required, when submitting a job mto choose which queue is most suitable. The full details of the BMRC cluster queues are shown below.

 Choosing the best queue is sometimes a process of trial and error, until you have a better idea of the requirements of your computing job. In most cases, the single most significant variable between queues is the maximum time for which a job can run. If your computing job will complete in under 30 hours, then you should choose one of our short queues. For longer jobs, we have long queues where the maximum running time is ten days. Full details of all available queues are shown below.

If you anticipate needing to run a computing job for longer than 10 days, please contact us to discuss.

It is important to choose the right queue because the scheduler will automatically delete a computing job if it attempts to exceed the time limit - even if the job has not finished. This can result in wasted time for you and wasted resources for other users — and you still won't have your results.

Cluster Queues and Nodes

Cluster Node Groups

The table below shows the groups of computers which form our cluster nodes.

Node Name Prefix Num. of nodes Num. of cores Total Cores Total Node Memory (GB) Memory per Slot (GB) Queues CPU Type Chip Speed (GHz)
C​ 108​ 16​ 1,728​ 256​ 16​ short.qc,long.qc Intel Ivybridge​ 2.6​
D​ 32​ 24​ 768​ 384​ 16​ short.qc,long.qc Intel Haswell​ 2.5​
E​ 102​ 24​ 2,448​ 384​ 16​ short.qc,short.qe,​long.qc Intel Skylake​ 2.6​
F​ 16​ 40​ 640​ 192​ 4.8​ short.qf,long.qf Intel Skylake​ 2.4​
H​ 3​ 48​ 144​ 2000​ 42.666​ himem.qh Intel Ivybridge​ 3​

How to read the table: The first line of the table displays that in our first group of cluster nodes, designated by the letter C, we have 108 computing nodes (i.e. 108 computers). These are named compC001 through to compC108. Each of these nodes contains 16 CPU cores each with Intel Ivybridge architecture at clock speed 2.6GHz and a total of 256GB of RAM. These C nodes are used by two queues, short.qc and long.qc.

The computers we use as cluster nodes are very powerful, typically much more powerful than is needed for a typical computing job. Consequently, each individual node is divided up into a number of virtual computer units comprised of 1 virtual CPU and some RAM. A virtual computer and RAM allocation are collectively called a slot. The final piece of information in the table is that the Memory per Slot for C group nodes is 16GB. By default, a individual computing job will run with one slot and so on our C group nodes it will run with one virtual CPU and 16GB of RAM. When a computing job needs a greater amount of RAM and/or would benefit from using more than one virtual CPU (e.g. in parallel computing), it is possible request more than one slot from the scheduler. Requesting two slots, for example, would mean that your job will run with 2 virtual CPUs and 2 x 16 = 32GB RAM. Of course, you should request extra slots only when they are needed or when testing whether would be of benefit.

Cluster Queues

The full list of cluster queues and maximum job durations is shown below.

Name Maximum Duration
test.qc 1 minute (i.e. for testing purposes only).
short.qcshort.qeshort.qf 30 hours
long.qclong.qf 10 days
himem.qh 14 days
gpu8.q 60 hours
 Note that test.qc jobs are run on the same groups of cluster nodes as long.qc i.e. on CD and E nodes.

Interactive and Non-Interactive Cluster Use: qlogin  and qsub

There are two different but complementary ways of getting your code running on the cluster.

qlogin

Using the qlogin command you can request an interactive cluster session in which you are given a live terminal (aka shell) on one of the cluster nodes with the amount of slots (CPU and RAM) you have requested. An interactive cluster session allows you to type commands and run code live from the terminal, just as you would from rescomp1-2but with the benefit to yourself of giving you dedicated resources and with the benefit to others that your code is not running on the login nodes.

Since you are given an interactive BASH shell, your shell environment will automatically execute both your ~/.bash_profile and ~/.bashrc files to perform its setup. This is different to qsub scripts where only your ~/.bash_profile will be executed.

To request a simple qlogin session, simple run:

qlogin -q short.qc

The output you see on screen will look as follows:

[username@rescomp1] $ qlogin -q short.qc
Your job 1234567 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 1234567 has been successfully scheduled.
Establishing /mgmt/uge/8.6.8/util/qlogin_wrapper.sh session to host compc002.hpc.in.bmrc.ox.ac.uk ...
Last login: Tue Jan 7 13:45:55 2020 from rescomp1
[username@compc002 ~] $

In the above output, you can read that your qlogin session has been successfully scheduled and you have been given an interactive cluster session on cmpc002. Notice how the prompt changes to [username@compc002] to show that you are now working on one of the cluster nodes rather than on rescomp1-2.

When you have finished with your interactive session, logout as normal by typing e.g. exit and you will return to rescomp1-2. If you wish to take a break from your interactive session then we recommend starting your interactive job from within a screen or tmux session and allowing screen or tmux to keep your interactive session alive.

The qlogin command accepts many of the same parameters and behaves similarly to the qsub command described in detail below. In particular, please note that the queue constraints still hold i.e. a session started with qlogin -q short.qc is still running on short.qc and subject to the same maximum duration of 30 hours.

qsub

Use of qsub to submit scheduled i.e. non-interactive jobs is described in detail below.

Submitting Jobs using qsub - Step by Step Guide for New Users

Prequisites:

  • You will need a cluster account
  • You need to know how to login.
  • You will need to know which projects you have access to (please ask your PI/group administrator).

First logon to rescomp1.well.ox.ac.uk following our access guide and ensure that you are in your home directory by running:

$ cd ~

In practice, the easiest and most flexible method to submit a cluster job is to prepare and then submit a bash script file. This bash script file can contain configuration settings for the scheduler, can load pre-installed software, can set any environment variables that your software needs, and then can launch your main software. A bash script is also useful when doing repetitive work such as running the same analysis several times with different input files (see array jobs below).

We begin with a simple example. Once logged into the rescomp1.well.ox.ac.uk run the command shown below. Note that you should replace project.prjc with the name of your own project:

$ qsub -P project.prjc $SGE_ROOT/examples/jobs/simple.sh
You job 4005 ("simple.sh") has been submitted.

This will submit the example script (simple.sh) located at $SGE_ROOT/examples/jobs/simple.sh. (NB Note that the dollar symbol in $SGE_ROOT indicates that this is a shell variable. You can see the full path to simple.sh by running e.g. echo $SGE_ROOT/examples/jobs/simple.sh. When you submit this job, the scheduler will report back that it has received your job and assigned it an ID number - see below for why this number is very important.

To run even such as simple task as this on the cluster involves a number of important steps that it will be helpful to understand. These are now discussed below.

Job ID

Note first that every job submitted to the scheduler receives a JOB ID number. In the example above, the job id number was 4005 but your job id number will be different (indeed, because the BMRC cluster handles so many jobs, it is not unusual to have a job id number in the millions).

The ID number of your computing job is the most important piece of information about your job. When you need to contact the BMRC team for help, please remember to tell us what your job id was. This will make it considerably easier for us to help you.

Log Files

When you run a cluster job, the scheduler automatically produces two files, which it names in the format <jobname>.o.<jobid> and <jobname>.e.<jobid>. The first is the output file and the second is the error file.

If your job (“simple.sh”) has already finished, you will be able to see these files in your current directory by running:

$ ls
simple.sh.o.4005 simple.sh.e.4005

NB If running the ls command does not show these files, wait a few more seconds and try again.

 When talking about these output files with other human beings, it's common to refer to them simply as the .o (dot-oh) and .e (dot-eee) files. The significance of the .o and .e files derives from the Linux convention that running a command or a script in the terminal produces two output streams, called the Standard Output (stdout) and Standard Error (stderr). (It can of course also send output directly to files on disk). Commands or scripts send their normal output to stdout while error messages go to stderr. When you run a command directly at the terminal, both streams are normally sent back to the terminal so you see them on screen. However, when running as a cluster job both streams are redirected to a file as happens here.

To see the contents of the log file, you can use the cat command. (For longer log files, the less command is probably more useful). Looking at the .o file should show similar output to what we saw when running sum.sh directly earlier.

qsub parameters and environment

As we have seen above, the scheduler software can be used very simply. However, it is also highly adaptable by passing certain configuration parameters. We will look at some of those parameters in more detail below. For now, we will focus on one important parameter - the environment.

When working in the shell in Linux, your commands are run in a particular context called the environment, which contains certain variables and their values. Your environment contains, for example, the PWD variable which stores which directly you are currently working in. You can see this by running echo $PWD. When you run a command directly at the shell, your command inherits your current environment; however, submitting jobs to the scheduler is different. These jobs will not run in your current shell and, indeed, they will not even run on the same computer you are currently logged into. What value, then, should the scheduler use for your current working directory (i.e. the PWD variable)?

To solve the problem with your current working directory, the scheduler works with a simple policy: unless you explicitly state what you want to the working directory to be, the scheduler will assume you want it to be your home directory. That is why, when you submitted your first job (“simple.sh”) above, your log files (.o and .e files) appeared there. It is important to understand that, although you were working in your home directory, that is not why the scheduler wrote your log files there. Instead, the scheduler wrote the log files to your home directory because you did not specify a different working directory, and when that is the case the scheduler assumes you want to work in your home directory.

Changing the behaviour of the qsub command can be achieved in three different ways:

  1. Extra parameters can be added directly on the command line. In the example above, you already used this method to specify which project your job should be associated with by adding -P project.prjc. Similarly, to tell the scheduler to use the current working directory (where you are now) as the current working directory when your script is run, you can add the -cwd parameter as follows:
    $ qsub -P project.prjc -cwd $SGE_ROOT/examples/jobs/simple.sh
    
     Using -cwd on its own as we do here means use the current working directory at the time of submitting to qsub as the current working directory for the script. It is also possible to specify a fixed directory using e.g. -wd /path/to/my/files (note that the parameter name here is -wd for working directory.
  2. Extra parameters can be added to your script files using the special script syntax #$. For example, instead of adding -P or -cwd to the command line you could add them to your bash script file like this:
    ...
    #$ -P project.prjc
    #$ -cwd
    ...
    
  3. You can also set parameters in a configuration file at $HOME/.sge_request. This is probably best used for configuration settings that you won’t change very often. For example. if you create the file $HOME/.sge_request and add these lines (similar to option 2)…
    #$ -P project.prjc
    #$ -cwd
    

    … then, these settings will be applied to every job you submit, unless you override them via methods 1 or 2. (Parameters in the script file override parameters on the command line which in turn override parameters in your local configuration file).

IMPORTANT - Avoiding inheriting environments with the -v parameter

Grid Engine permits the usage of the -V parameter which instructs the scheduler to make a copy of the user's shell environment, including all environment variables, at the time of submission and to add these to the environment when your scheduled job is run on a cluster node. We strongly recommend NOT using the -V parameter. Instead, we recommend putting all the code needed to setup your desired environment into your bash submission script.

The -V parameter interferes with the automatically selection of software and modules as discussed in our module guide.

 

Writing your own bash script file

Now that we have practised with the example file at $SGE_ROOT/examples/jobs/simple.sh, let’s try writing our own bash script file.

First of all, make a subdirectory in your home folder called qsub-test and change directory into it:

$ mkdir qsub-test
$ cd qsub-test

Now copy the contents below into a file named python-hello.sh in your $HOME/qsub-test folder.

#!/bin/bash
#$ -wd $HOME/qsub-test
#$ -P project.prjc
#$ -N my-python-job
#$ -q short.qc

echo "------------------------------------------------"
echo "Run on host: "`hostname`
echo "Operating system: "`uname -s`
echo "Username: "`whoami`
echo "Started at: "`date`
echo "------------------------------------------------"

module load Python/3.7.2-GCCcore-8.2.0

sleep 60s
python -c 'print("Hello")'

This script works as follows.

  • The first line $!/bin/bash is a special command which says that this script should use the bash shell. You will normally want to include this in every bash script file you write.
  • The next three lines use the special #$ comment syntax to add some parameters to our qsub command. In particular:
    • we specify our project name using -P (make sure to substitute your own project name).
    • We also specify -wd $HOME/qsub-test so that the script will be run with $HOME/qsub-test as its current directory.
    • We use -N to specify a name for this job. By default, jobs are named after their script files, but you can specify any name that would be helpful to you.
    • Finally, we use -q short.qc to specify that the job short run on that particular queue.
  • In the next lines (all beginning echo...) we print some useful debugging information. This is included here just for information.
  • The line module load Python/3.7.2-GCCcore-8.2.0 uses the module system to load a version of Python.
  • The line sleep 60s causes our script to pause for 60 seconds.
  • The final line uses python to print the message ‘Hello’

Now, for sake of this example, we will move back to our home directory and submit the script file from there:

$ cd $HOME
$ qsub qsub-test/python-hello.sh
Your job 4006 ("my-python-job") has been submitted.

Now that your job has been submitted, if you are quick enough (within 60 seconds!) you can use the qstat command to check the status of your job. The output will look something like the below.

$ qstat
job-ID  prior   name            user          state  submit/start at      queue    jclass    slots ja-job-ID
------------------------------------------------------------------------------------------------------------
4006    0.000   my-python-job   <username>    qw     01/09/2020 21:14:18                         1
 You can see further details of the qstat command below. For now, simply note that without arguments, qstat will show you the status of all your queued or running jobs. If you have several queued or running jobs and just want to see the status of one of them, you can use qstat -j <job_id>

Because we told the scheduler to use the folder $HOME/qsub-test as the working directory for this script, that is where it will put our log file (even though we submitted the script from our $HOME directory). Now let’s look at the log files:

$ cd qsub-test
$ ls
my-python-job.o.4006 my-python-job.e.4006
$ cat my-python-job.o.4006
------------------------------------------------
Run on host: compf003.hpc.in.bmrc.ox.ac.uk
Operating system: Linux
Username: <your-username>
Started at: Thu Jan 9 16:37:58 GMT 2020
------------------------------------------------
Hello

The output shows that this script ran on the node compf003.hpc.in.bmrc.ox.ac.uk, however, you will likely see a different node. It also shows that the cluster node’s operating system was Linux, that the job was run as your username, and that the job started at the time shown.

Since this job ran successfully, the .e file will be empty, which is a reassuring sign! You can check that it’s empty, if you like, using the cat command.

 Note that we explicitly specified a queue for this job to be submitted to. If you don't specify a queue, jobs will run by default on test.qc queue which has a maximum run time of 1 minute. The test.qc queue is good for testing purposes but it's not the queue you want to use normally.

Submitting to Specific Host Groups or Nodes

Sometimes, you may wish to target which cluster nodes your job will run more finely than by specifying just a queue. In this case, it is possible to add further information to your queue specification.

  • Specifying e.g. qsub -q short.qc@@short.hgc will run on the short queue AND only on C nodes, while qsub -q short.qc@@short.hge will run on the short queue AND only on the E nodes. The normal list of “hostgroups” that you can use for these purposes are:
    • short.hgc, short.hgd, short.hge, short.hgf
    • long.hgc, long.hgd, long.hge, long.hgf

It is rarely desirable to limit running your job to a particular node, but in cases where this is desirable, you can use the ‘@@’ specification similarly to specify a particular target node.

  • Specifying e.g. qsub -q short.qc@@compd024 will run in short.qc AND only on node compd024.

Command Line Options and Script Parameters

The qsub command can be used to run jobs very simply. However, it also accepts a wide range of configuration settings which we can add to the qsub command. For example, the option -q specifies which queue to submit to. For example, we could resubmit sum.sh to a different queue as follows:

$ qsub -P group.prjc -q short.qe sum.sh
 Although we are resubmitting the same script sum.sh to the cluster, this still represents a new job so the scheduler will issue a new job id.

Typing all the desired options for job at the command line is certainly possible, but it also has disadvantages: not only does it require more typing, it makes it more difficult to record which options were used at a particular time. So, instead of having to type these options in every time, it is also possible to include them directly in our script file.

Using your favourite editor, let’s update the sum.sh file as follows:

#!/bin/bash

#$ -P group.prjc
#$ -q short.qc

echo "------------------------------------------------"
echo "Run on host: "`hostname`
echo "Operating system: "`uname -s`
echo "Username: "`whoami`
echo "Started at: "`date`
echo "------------------------------------------------"

echo "1+1=$((1+1))"

The two additional lines use a special comment syntax, where a line begins #$ followed by a space, followed by the same parameter that you would use at the command line. Now that we have specified our group and project inside our script, we can run this job more simply than before like this:

% qsub sum.sh

Instead of writing your own job submission script from scratch yo can also use a copy of our job submission template file, located at /apps/scripts/sge.template.sh which contains hints on a number of commonly used parameters. To use this file, begin my making a copy into your own home directory by running:

cp /apps/scripts/sge.template.sh ~/myjobscript.sge.sh

Then you can edit the file for your own purposes.

Checking or Deleting Running Jobs

A busy computing cluster handles many thousands of computing jobs every day. For this reason, when you submit a computing job to the cluster your job may not run immediately. Instead, it will be held in its queue until the scheduler decides to send it to a cluster node for execution. This is one important difference to what happens when you run a job on your own computer.

You can check the status of your job in the queue at any time using the qstat command. Running qstat on its own will report the status of every job of yours which remains in the queue. Alternatively, use qstat -j <jobid> to report the status of of an individual job. The output of these commands will look like this:

job-ID     prior   name       user          state submit/start at        queue    jclass    slots ja-job-ID
-----------------------------------------------------------------------------------------------------------
22355866   0.000   sum.sh     <username>    qw    01/09/2020 21:14:18                         1

qstat reports lots of useful information such as job id, submission time (or, if the job has started, its start time), and the number of reserved slots. However, perhaps the most important piece of information is in the state column. For example, qw means that the job is held in the queue and waiting to start. Alternatively, a state of r means tha the job is currently running.

One important thing to note is that the qstat command reports only jobs which are either waiting to run or currently running. Once a job has finished qstat will no longer report it. Instead, once a job has submitted you can see a detailed audit, including total run time and memory usage, by running qacct -j <jobid>.

Occasionally, you may wish to delete a job before it start or before it completes. For this you can use the qdel -j <jobid> command.

 A template bash script is available at /apps/scripts/sge.template.sh which you can copy and modify for your own needs.

Preparing a bash script file

We’ll start by using the bash script template located at /apps/scripts/sge.template.sh. You should take a copy of this file to use as a starting point for your own scripts. The file will begin with the following content:

#!/bin/bash

# Specify a job name
#$ -N jobname

# --- Parameters for the Queue Master ---
# Project name and target queue
#$ -P group.prjc
#$ -q test.qc

# Run the job in the current working directory
#$ -cwd -j y

# Log locations which are relative to the current
# working directory of the submission
###$ -o output.log
###$ -e error.log

# Parallel environemnt settings
#  For more information on these please see the wiki
#  Allowed settings:
#   shmem
#   mpi
#   node_mpi
#   ramdisk
#$ -pe shmem 1

# Print some useful data about the job to help with debugging
echo "------------------------------------------------"
echo "SGE Job ID: $JOB_ID"
echo "SGE Job ID: $SGE_JOB_ID"
echo "Run on host: "`hostname`
echo "Operating system: "`uname -s`
echo "Username: "`whoami`
echo "Started at: "`date`
echo "------------------------------------------------"

# Finally, we can run our real computing job

echo $JOB_ID

# End of job script

The first line #!/bin/bash specifies that this is a bash script (this will always be the same).

Any line beginning with # is a bash comment i.e. it will be ignored by the bash interpreter when the script file is actually run. Comments are most often there to make the code more human readable, i.e. to explain what the code does to any human readers. However, our template file also contains a number of special comments which begin #$. These special comments which provide configuration settings and will be read and understood by the scheduler.

For example, it is often helpful to specify a name for your job - you can do this by changing the line #$ -N jobname. Some of the configuration settings in the template file are optional but others are required. For example, submitting a job to the cluster requires that you specify a project to associate the job to. This is used for accounting and also defines which of your affiliations the work is attributed to. Each group that can use the cluster will be given a project, often associated with the PI of the group.

 It also also possible to pass configuration settings to the queue master when submitting a job on the command line. However, including them in your bash script has the benefit of saving typing if you ever want to run the same or a similar job again and helps to keep a record of which settings were used.

The penultimate section of the bash script template uses the echo command to output some information which can be useful in debugging.

Finally, the script runs the actual computing job. In this case, the job is a trivial one of printing out the job id, which is achieved by the line echo $JOB_ID. However, it is possible to achieve very complex things your bash script file including loading specific software modules, looping over a set of data or running a whole pipeline in sequence.

 

commonly used qsub Parameters

Run man qsub to see the full list of qsub options. However, here are some commonly used options as a guide:

-P <project>.prjc Specify which project your job belongs to (required)
-q <queuename> Specify which queue to run a job in. See the full list of cluster queues
-j y Instead of separate output and error log files, send the error output to the regular output log file
-o <dir>, -e <dir> Put the output log file (-o) and/or error log file (-e) into the specified directories keeping their default file names. By default, they go into the current working directory.
-o <path/to/file>, -e <path/to/file> Put the output log file (-o) and/or error log file (-e) into the specified files.
-cwd Use the current working directory at the time of job submission (i.e. your location when you run qsub) as the current working directory when executing your script
-wd </path/to/dir> Use /path/to/dir as the working directory when executing your script
-pe shmem N Request N slots. See cluster queues reference for full details of our nodes. NB The higher the number of slots requested the longer it will normally take for your job to be scheduled. A practical maximum of 12 slots is possible on short and long queues. The theoretical maximum is 16 slots on C nodes, 24 slots on D and E nodes, 12 slots on F nodes and 46 slots on H nodes, but you are advised to request less than the maximum theoretical available in order for your jobs to be scheduled within a reasonable time.
-pe ramdisk <config> Use a ramdisk - see the ramdisk guide.
-b y <executable> Indicate that your job is not a shell script but a binary that should be run directly
-S </path/to/shell> By default, our cluster uses the Bash shell. Use -S if you need your job submission script to be interpreted by another shell.

 

Array Jobs

If you have a set of jobs that perform the same action on many different inputs then you may be better submitting an SGE array job. An array job:

  • is made up of a number of tasks, with each task having a specific task id
  • has a single submission script that gets executed for each task in the array
  • has a single job id

The tasks get scheduled separately such that many tasks can be executing at the same time. This means tasks should not depend upon each other as there is no guarantee any one task will have completed before another starts.

The task id, along with other relevant information, is passed to the job through some environment variables. The task id is a number. It may be possible to use this number directly in your script or it may be necessary to use the task id to identify the specific inputs for the task.

There are a number of benefits to choosing to use array jobs rather than submitting lots of nearly identical jobs. As the entire array is just a single job, it is easier for you to keep track of and manage your jobs. Also, array jobs put much less load on the job scheduling system than the equivalent number of single jobs.

Declaring an array job

Array jobs are declared using the ‘-t’ argument to qsub:

-t n[-m[:s]]
: n is the first task id;
m is the last task id;
s is the task id step size (default is 1);
If only n is specified, it is treated as ‘-t 1-n:1’.

For example, the task id range specified by 2-10:2 would result in the task id indexes 2, 4, 6, 8, and 10, for a total of 5 otherwise identical tasks.

The following restrictions apply to the values n and m:

1 <= n <= MIN(2^31-1, max_aj_tasks)
1 <= m <= MIN(2^31-1, max_aj_tasks)
n <= m

where max_aj_tasks is defined in the cluster configuration (see qconf -sconf, 75000 at the time of writing). If max_aj_tasks is insufficient then please contact bmrc-help@medsci.ox.ac.uk to ask for it to be raised.

The number of tasks eligible for concurrent execution can be controlled by the ‘-tc’ argument to qsub:

-tc max_running_tasks

max_running_tasks is the number of tasks in this array job that are allowed to be executing at the same time. For example, -t 1-100 -tc 2 would submit a 100-task array job where only 2 tasks can be executing at the same time. Note that this is the maximum allowed to run at the same time and does not meant that tasks will be run in pairs.

Environment variables

SGE sets the following task-related environment variables:
SGE_TASK_ID - the task id of this task
SGE_TASK_FIRST - the task id of the first task in the array (parameter ‘n’ given to the ‘-t’ argument)
SGE_TASK_LAST - the task id of the last task in the array (parameter ‘m’ given to the ‘-t’ argument)
SGE_TASK_STEPSIZE - the size of the step in task id from one task to the next (parameter ‘s’ given to the ‘-t’ argument)

Step size and batching

There is a significant dead time between one job or task finishing and the next starting - sometimes as much as 15s. This time is spent reporting the job/task’s exit status and resource usage, tidying up after the job/task, waiting for the next run of the scheduler to select the next job/task and preparing for that job/task to run. If the action performed by the script takes more than a couple of hours to run, this dead time is fairly insignificant. If the action takes one second to run then running it once per job/task means that the dead time is 15 times longer than the run time and it will take very much longer than expected to finish the job. The best approach for such short job/task run times is to batch multiple runs together.

This can be done by using an array job with a step size > 1 and a loop within the script that loops over all the task ids in the step. The step size should be chosen such that the combined run time for all the tasks in the step is long enough to make the dead time insignificant without putting the task at risk of being killed for exceeding the queue’s run time limit. It is also polite to not hog the execution slot, so 2-4 hours is a good target. With a little bit of arithmetic it is possible to write the script in such a way that all that needs adjusting is the step size and the script takes care of the rest. See ‘Example using step size’ below for an example script.

Simple example

#!/bin/bash

# This script sets up a task array with a step size of one.

#$ -N TestSimpleArrayJob
#$ -q test.qc
#$ -t 1-527:1
#$ -r y

echo `date`: Executing task ${SGE_TASK_ID} of job ${JOB_ID} on `hostname` as user ${USER}
echo SGE_TASK_FIRST=${SGE_TASK_FIRST}, SGE_TASK_LAST=${SGE_TASK_LAST}, SGE_TASK_STEPSIZE=${SGE_TASK_STEPSIZE}

##########################################################################################
#
# Do any one-off set up here
#
## For example, set up the environment for R/3.1.3-openblas-0.2.14-omp-gcc4.7.2:
## . /etc/profile.d/modules.sh
## module load R/3.1.3-openblas-0.2.14-omp-gcc4.7.2
## which Rscript
#
##########################################################################################

##########################################################################################
#
# Do your per-task processing here
#
## For example, run an R script that uses the task id directly:
## Rscript /path/to/my/rscript.R ${SGE_TASK_ID}
## rv=$?
#
##########################################################################################

echo `date`: task complete
exit $rv

Example using step size

#!/bin/bash

# This script sets up a task array with one task per operation and uses the step size
# to control how many operations are performed per script run, e.g. to manage the
# turnover time of the tasks. This also makes it a bit easier to re-run a specific
# task than using a step size of one and an unrelated loop counter inside the script

#$ -N TestArrayJobWithStep
#$ -q test.qc
#$ -t 1-527:60
#$ -r y

echo `date`: Executing task ${SGE_TASK_ID} of job ${JOB_ID} on `hostname` as user ${USER}
echo SGE_TASK_FIRST=${SGE_TASK_FIRST}, SGE_TASK_LAST=${SGE_TASK_LAST}, SGE_TASK_STEPSIZE=${SGE_TASK_STEPSIZE}

##########################################################################################
#
# Do any one-off set up here
#
## For example, set up the environment for R/3.1.3-openblas-0.2.14-omp-gcc4.7.2:
## . /etc/profile.d/modules.sh
## module load R/3.1.3-openblas-0.2.14-omp-gcc4.7.2
## which Rscript
#
##########################################################################################

# Calculate the last task id for this step
this_step_last=$(( SGE_TASK_ID + SGE_TASK_STEPSIZE - 1 ))
if [ "${SGE_TASK_LAST}" -lt "${this_step_last}" ]
then
    this_step_last="${SGE_TASK_LAST}"
fi

# Loop over task ids in this step
while [ "${SGE_TASK_ID}" -le "${this_step_last}" ]
do
    echo `date`: starting work on SGE_TASK_ID=`printenv SGE_TASK_ID`

##########################################################################################
#
#   Do your per-task processing here
#
##  For example, run an R script that uses the task id directly:
##  Rscript /path/to/my/rscript.R ${SGE_TASK_ID}
##  rv=$?
#
##########################################################################################

    # Increment SGE_TASK_ID
    export SGE_TASK_ID=$(( SGE_TASK_ID + 1 ))
done

echo `date`: task complete
exit $rv

Mapping task-specific arguments from task ids The examples above just use the task id directly but this isn’t always possible. There are a number of ways that one might be able to map the task id to the task-specific arguments but if your requirements are complicated then you may have to write some code to do it. However, there are some simple options.

If you have a file that lists the task-specific arguments, with the arguments for the nth task on the nth line:

cat /path/to/my/argument/list | tail -n+${SGE_TASK_ID} | head -1 | xargs /path/to/my/command

Examples where this might be useful include processing a list of genes or snps.

Ramdisks

In some applications a dependency on a high-speed local file system is required, and for a lot of the nodes in the cluster the default ${TMPDIR} is not very performant. To alleviate these issues you can allocate some of the memory from your job as a disk capable of handling temporary files. This has limitations, as you need a clear idea on what the upper limits will be required to both satisfy the memory for the job and the temp space required.

Tutorial

Instead of specifying the ‘shmem’ parallel environment variable to specify your slots, you ‘'’instead’’’ will use ‘ramdisk’ to specify your slots required:

-pe ramdisk <number_of_slots>

You will ‘'’also need to set a hard vmem limit’’’. This will set the limit of how much memory you want to be available to the application. This must not exceed the total memory available for the given slots on your selected node:

-l h_vmem=<memory_limit>

A ramdisk will be created at ${TMPDIR}/ramdisk of size equal to the difference of the total memory available using the selected slots minus the hard limit of memory assigned for the job. The value of ${TMPDIR} is that given to your script by SGE. If you change ${TMPDIR} in your script then make sure your script makes a note of where the ramdisk is mounted before the change is made.

E.g. for a job that needed 20GB memory and some ramdisk, the arguments:

-P group.prjc -q short.qc -l h_vmem=20G -pe ramdisk 2 ./jobscript.sh

Given the default in that queue of 16G per slot, these flags would limit your job to 20G of memory but also create a 12G ramdisk using the remaining available memory provided by 2 slots.

Job Dependencies

Once your cluster workflows reach a certain level of complexity, it becomes natural to want to order them into stages and to require that later jobs commence only after earlier jobs have completed. The scheduler includes dedicated parameters to assist with creating pipelines to meet this requirement.

To ensure that job B will commence only after job A has finished, you first submit job A and make a note of either it's job id number or its name (this is especially useful if you have assigned a name using -N). Now you can submit job B and specify that it should be held until job A finishes:

qsub -hold_jid <jobA_id_or_name> ...

If wishing to make such hold requests in a shell script, it may help to submit job A using the -terse parameter so that only its job id number is returned. You can then collect this id number in a shell variable and use it when submitting job B like so:

JOBA_ID=$(qsub -t -q short.qc jobA.sh)

qsub -q short.qc -hold_jid $JOBA_ID jobB.sh

 

It is also possible to use an advanced form of job holding when you have sequential array jobs and you wish the the tasks in the second job to start as soon as the correspnding task in the first job has completed. For example, if jobA is an array job with tasks numbered 1-100 and jobB is another array job also with tasks number 1-100 then one can arrange for job B's subtasks to run as soon as the corresponding task in job A has completed by running:

qsub -q short.qc -t 1-100 -N jobA jobA.sh

qsub -q short.qc -hold_jid_ad jobA jobB.sh

Using this method, which demonstrates the using holding a job by name rather than by job id, job B's tasks will be allowed to run (subject to availability on the cluster) as soon as the corresponding task in jobA has completed. Note that since job A's tasks may have different run-time durations and so may complete in any order, job B's tasks may commence in any order.

Troubleshooting

Cluster jobs can go wrong for many different reasons. If one of your computing jobs hasn’t behaved as expected, here are some suggestions on how to diagnose the issue. To begin, we focus on jobs which have already run to completion.

Check the Job Report using qacct

Login to the cluster and run: qacct -j <job-id>. This will print a summary report of how the job ran which looks as follows:

# qacct -j 12345678
==============================================================

qname         short.q
hostname      <hostname>
group         users
owner         <user>
project       NONE
department    defaultdepartment
jobname       <jobname>
jobnumber     12345678
taskid        undefined
account       sge
priority      0
qsub_time     Fri Jan 24 195835 2020
start_time    Fri Jan 24 195842 2020
end_time      Fri Jan 24 195943 2020
granted_pe    NONE
slots         1
failed        0
exit_status   0
ru_wallclock  61
ru_utime      0.070
ru_stime      0.050
ru_maxrss     1220
ru_ixrss      0
ru_ismrss     0
ru_idrss      0
ru_isrss      0
ru_minflt     2916
ru_majflt     0
ru_nswap       0
ru_inblock    0
ru_oublock    176
ru_msgsnd     0
ru_msgrcv     0
ru_nsignals   0
ru_nvcsw      91
ru_nivcsw     8
cpu           0.120
mem           0.001
io            0.000
iow           0.000
maxvmem       23.508M
arid          undefined

Now run through the following checks:

  • Check the Failed status. If Failed is 0 then your job ran successfully to completion (from a code execution point of view). However, if Failed is 1 then check the Exit Status.
  • The Exit Status reported by qacct -j will be a number from 0 to 255. If the Exit Status > 128 (i.e. Exit Status is 128+n) then the computing job was killed by a specific shell signal whose id number is n. You can find which signal corresponds to n by running man 7 signal e.g. on rescomp1.well.ox.ac.uk. For example, Exit Status 137 is 128+9 where 9 is the signal for SIGKILL, which means that a job was automatically terminated, normally because it exceeded either its duration or its maximum RAM.
  • Check maxvmem. This figure should not exceed the RAM limit of the queue (see table).

If qacct -j reports that your job ran to completion (i.e. if Failed is 0) then the next step is to examine the .o and .e log files.