Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

GPU Resources

The BMRC cluster includes a number of NVidia GPU-accelerated servers in order to support AI/ML, image processing and other GPU-accelerated workflows. 

If you have any questions or comments about using the GPU resources on BMRC please contact us (bmrc-help@medsci.ox.ac.uk)

 

RecenT Changes

Interactive sessions are available through Slurm using the gpu_interactive partition.

compg039-compg041 have been added into service.

 

VARIETIES OF GPU NODE

In our regular (i.e. non-GPU) cluster, there are groups of nodes (e.g. compa, compe, compf) where the hardware varies between groups but is identical within each group. The situation is different within the compg GPU nodes. Because of rapidly changing hardware capabilities, there is considerable variation in the hardware capabilities of the GPU nodes: they offer different combinations of CPU and RAM as well as different numbers and types of GPU card. Furthermore, each machine is configured to host only as many scheduler slots as it has GPU cards, on the assumption that every job will need at least one GPU card. In consequence, the Available RAM per slot on the GPU partition can vary widely from the minimum of 60.8GB of RAM up to 750GB.

 

Because of the variation in CPU, RAM, GPU card type and number of GPUs available per node, you may need to plan your job submissions carefully. The sections below provide full information on the nodes available in order to assist with your planning.

 

SCHEDULED GPU CLUSTER NODES

There are two Slurm partitions for GPU resources, gpu_short and gpu_long.

Jobs run on gpu_short have a maximum job duration of 4 hours.

Jobs run on gpu_long have a maximum job duration of 60 hours.

gpu_long is only available on a subset of nodes so it is recommended that you submit jobs to gpu_short when you can.

 

There is a partition for interactive jobs

gpu_interactive 

 

There are three Slurm partitions for specific resources/workflows, please contact us if you need access to these. 

gpu_relion

gpu_long_palamara

gpu_long_zhang

 

Jobs are submitted to gpu_short (or gpu_long) using sbatch in a similar way to submitting a non-gpu job; however, you must supply some extra parameters to indicate your GPU requirements as follows:

sbatch -p gpu_short --gres gpu:<N> <JOBSCRIPT>

 <N> is the number of GPUs required for each job.

 

Alternatively you can use

sbatch -p gpu_short --gpus-per-node <N> <JOBSCRIPT>

 

The recommended way to request GPUs for jobs on the BMRC Slurm GPU queues is to use --gres or --gpus-per-node

 

There are other options in Slurm for requesting GPUs including --gpus,--gpus-per-task and --gpus-per-socket. These are relevant for MPI workloads, and can lead to blocking reservations, so please contact BMRC before using them.

 

Optionally, you can specify one type of GPU to run on with e.g.:

sbatch -p gpu_short --gres gpu:a100-pcie-40gb:1 <JOBSCRIPT>

 

You can use Slurm Features/Constraints to specify the class(es) of GPU(s) that you wish your job to run on. The features are listed in the 'Slurm GPU Feature' column in the table below. For example, to run on P100 and A100 nodes only:

sbatch -p gpu_short --gpus-per-node 1 --constraint "p100|a100" <JOBSCRIPT>

 

The default number of CPU cores per GPU is 6. You can request more (or fewer) CPU cores for your job with  --cpus-per-gpu <N>. Alternatively, you can set the total number of cores required for the job with -c <N>. <N> is the number of cores.

 

The default system memory available per GPU is 60.8 GB. You can request more (or less) system memory for your job with --mem-per-gpu <M>G. Alternatively, you can specify the total memory requirement for your job with --mem <M>G. <M> is the number of GB of memory required.

 

 

Node

GPU Type

Slurm Features

Num GPU Cards

 GPU RAM per card

CPU Cores per GPU

RAM GB per GPU

CPU Compatibility

compg009

p100-sxm2-16gb

p100, flash

4

16

6

91.2

Skylake

compg010

p100-sxm2-16gb

p100,flash

4

16

6

91.2

Skylake

compg011

p100-sxm2-16gb

p100, flash

4

16

6

91.2

Skylake

compg013

p100-sxm2-16gb

p100

4

16

6

91.2

Skylake

compg016

v100-pcie-32gb

v100, flash

2

32

6

750

Skylake

compg026

p100-pcie-16gb

p100, flash

4

16

10

91.2

Skylake

compg027

v100-pcie-16gb

v100

4

16

12

60.8

Skylake

compg028

quadro-rtx8000

rtx8000, flash

4

48

8

187.2

Cascadelake

compg029

quadro-rtx8000

rtx8000, flash

4

48

8

187.2

Cascadelake

compg030

quadro-rtx8000

rtx8000, flash

4

48

8

187.2

Cascadelake

compg031

a100-pcie-40gb

a100, flash

4

40

8

91.2

Cascadelake

compg032

a100-pcie-40gb

a100, flash

4

40

8

91.2

Cascadelake

compg033

a100-pcie-40gb

a100, flash

4

40

8

91.2

Cascadelake

compg034

a100-pcie-40gb

a100, flash

4

40

8

91.2

Cascadelake

compg035

a100-pcie-80gb

a100, flash

4

80

8

91.2

Icelake

compg036

a100-pcie-80gb

a100, flash

4

80

8

91.2

Icelake

compg037

a100-pcie-80gb

a100, flash

2

80

24

256

Icelake

compg038

a100-pcie-80gb

a100, flash

2

80

24

256

Icelake

compg039

a100-pcie-80gb

a100, flash

4

80

12

128

Icelake

compg040

a100-pcie-80gb

a100, flash

4

80

12

128

Icelake

compg041

a100-pcie-80gb

a100, flash

80

12

128 

Icelake

 

INTERACTIVE GPU Sessions

You can get an interactive session using the gpu_interactive partition. The partition has a 12 hour runtime limit. Jobs submitted to this partition will get submitted to compg009, compg010.

 

srun -p gpu_interactive --gres gpu:1 --pty bash

 

LEGACY DEDICATED NODES

We maintain a number of GPU nodes which are dedicated to specific legacy projects. Please email us with any questions regarding these dedicated nodes.

 

Node

GPU Type

Num GPU cards

GPU RAM per card

CPU Cores

Total RAM GB

CPU Compatibility

compg012

GTX 2080 Ti

4

11

20

256

Skylake

compg014

GTX 2080 Ti

4

11

24

384

Skylake

compg015

GTX 2080 Ti

4

11

24

384

Skylake

compg017

v100-pcie-32gb

2

32

24

1500

Skylake

compg018

quadro-rtx6000

4

24

32

384

Skylake

compg019

quadro-rtx6000

4

24

32

384

Skylake

compg020

quadro-rtx6000

4

24

32

384

Skylake

compg021

quadro-rtx6000

4

24

32

384

Skylake

compg022

v100-pcie-16gb

4

16

32

384

Skylake

compg023

quadro-rtx6000

4

24

32

384

Skylake

compg024

quadro-rtx6000

4

24

32

384

Skylake

compg025

quadro-rtx8000

4

48

32

384

Skylake

 

FAST LOCAL SCRATCH SPACE

A number of nodes have fast local NVMe drives for jobs that require a lot of I/O. This space can be accessed from:

/flash/scratch

or from project specific folders in /flash on the nodes.

In Slurm you can select nodes with a scratch folder with:

sbatch -p gpu_short --gpus-per-node 1 --constraint "flash" <JOBSCRIPT>

This is folder is open to all jobs, so care should be taken to protect your data by placing it in subfolders with the correct permissions.

As the space on these drives is limited you should remove any data from the scratch space when the job is complete. A scheduled automatic deletion from /flash/scratch will be introduced.

 

MONITORING

 

In an interactive session you should use the nvidia-smi command to check what processes are running on the GPUs and top to check what is running on the CPUs.

 

 On the scheduled nodes, from a login node you should run

 squeue -p gpu_short,gpu_long

 to see the jobs running and waiting in the GPU queue.

 

GPU SOFTWARE

The CUDA libraries are required to run applications on NVidia GPUs. More advanced GPUs require later versions of the cuda libraries. The CUDA page on wikipedia has useful information about versions. Software packages typically need to be compiled for a particular version of cuda.

Our pre-installed CUDA-related software is made available, in the same way as the majority of our pre-installed software, via software modules. Use module avail to see which software packages are available and module load <MODULE-NAME> to load your desired software modules.

 

In addition to the main CUDA libraries themselves, we also have:

  • A number of other widely used GPU software packages.

 

You can also install your own software via e.g. a python virtualenv or conda.

 

RELION GPU

Information about running Relion on the Slurm GPU partitions will follow.