Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

How to access the BMRC GPU resources.

Before running jobs on the BMRC GPU cluster resources you should know:

1 If your Slurm project has GPU shares ?
2 What GPUs you want your jobs to run on ?
3 How long your jobs will take ?

 

Jump to 'Submitting Jobs' for example Slurm commands.

 

Overwiew for PIS

Since 2011, BMRC has sold access to the cluster based on the idea of a “share”. In its simplest form, you could imagine that each group was buying a share of the cluster platform. If one group bought twice the share of another group then, at any time, the scheduler would try to be running twice the number of jobs for the first group as the second group. However, if only one group was submitting jobs at a certain time, then all resources would be handed to that group: use it or lose it, credits could not be stored up. While the details of the implementation have changed and BMRC has become immensely more complex, the basic approach remains the same. You can find out more about the BMRC share philosophy and the fairshare calculation at: https://www.medsci.ox.ac.uk/for-staff/resources/bmrc/cluster-shares.

When there were only a few GPUs in the cluster, and only a few groups using them, we were able simply to extend the existing share model to cover them without too many issues. However, over the past few years GPU methods have become mainstream and GPUs have become incredibly expensive: BMRC currently invests much more in GPU-accelerated servers than in CPU-only servers, but even so it has many fewer GPUs than it has CPU cores. This has led to severe scheduling challenges with the current approach and means that BMRC is not recovering enough through its charging to replace GPUs as they get old without unfairly charging users of CPUs.

BMRC is now charging separately for CPU shares and GPU shares. To run CPU-only jobs you will need CPU shares and to run GPU jobs you will need GPU shares – to run both types of compute you will need both types of shares. This has meant that CPU shares are now significantly cheaper than they were previously. Note that the cost of the GPU share includes the costs of the CPUs and memory of the servers hosting the GPUs: you don’t need CPU shares to run a GPU job.

Having looked at GPU usage patterns we have decided we can offer entry-level continuous access to GPUs for small-to-moderate use at just over £1000 per project per year. Groups with heavier usage will simply need to buy multiple shares. Just as with CPU shares and to avoid overcommitting, there is a cap on the total number of GPU shares that will be sold for the GPU partitions. The cap is related to the total number of physical GPUs that we have available.

To extend the share approach to GPUs has been quite complicated since there are many different kinds of GPU. To solve this, we have further defined a weighting for each type of GPU depending on the cost of providing them (buying, powering and administering them). When accounting is performed, the runtime of a job is multiplied by the relevant GPU weight meaning that users can use more time on cheaper GPUs for the same cost to the project. This also means that only one type of GPU share is needed and it is good for all our types of GPU from RTX6000 to H200, dramatically simplifying the accounting.

Shares

BMRC sells both CPU and GPU shares to projects to enable use of the Slurm cluster. CPU and GPU shares are separate: CPU shares do not give access to GPU-accelerated nodes and vice versa. If you wish to enquire about GPU shares please send a request to bmrc-help@medsci.ox.ac.uk.

A GPU share is an abstract quantity which affects the scheduling priority of the job. Once a job is running it gets all the resources that it has requested and it is not competing with other jobs. More shares mean higher priority for jobs in the queue. Each type of GPU has been given a weighting depending on the cost of providing them (buying, powering and administering them). For reference, an 80GB A100 GPU is defined to have a weighting of 1.00.

Based on an analysis of previous usage we defined an appropriate GPU share price based on the cost of continuous usage of up to 1/3 of a GPU. If there is no GPU usage for any particular quarter then there will be no charge for the GPU shares for that quarter.

 

Selecting the gpu type

BMRC supports a number of GPU types. There is a partition for each GPU type. You must select the partitions corresponding to the GPUs that you wish to run your jobs on.

 

Important factors are:

1 Billing weight

2 GPU memory

3 Numerical precision

 

Each GPU type has a billing weight reflecting the cost of running the GPU nodes. This is factored in to usage calculations by the scheduler. More expensive GPUs have a higher weight than cheaper GPUs. If the weight is <1 then less usage will be accounted for. This has an impact in how jobs are prioritised in the scheduler, less usage will lead to a higher priority. Note this is called a 'billing' weight by Slurm, it is *not* a financial weighting on the cost of a GPU share.

 

To maximise throughput of your jobs you should have an estimate of the GPU memory required by your job and select partitions that satisfy the memory requirement. You can find information about GPU memory in the tables below.

 

All GPUs provide incredible FP32 performance (default single precision), most applications work at this precision. If your jobs require double precision (FP64) or a lower precision (e.g. FP16, FP8) or more sophisticated cores for e.g. AI workloads, then you should carefully select the appropriate GPUs for your task (A100, V100, P100). 

 

There is a partition (gpu_inter) for users who need shell access to a GPU node for development or analysis.

 

Partition Table

PARTITION NUM_GPU  GPU_MEMORY(GB) WEIGHT MAX_RUNTIME(Hours) NUM_CPU_DEFAULT MEM(GB)_DEFAULT
Batch Partitions
gpu_a100_80gb 24 80 1 60 11 120
gpu_rtx8000_48gb 12 48 0.72 60 7 185
gpu_a100_40gb 16 40 0.89 60 7 90
gpu_v100_32gb 2 32 0.89 60 7 750
gpu_p100_16gb 12 16 0.66 60 5 90
gpu_v100_16gb 4 16 0.7 60 11 60
Interactive Partitions
gpu_inter 18 24 0.59 12 7 80

 

Selecting runtime

The maximum runtime for most of the GPU partitions is 60hrs, some are shorter. If you know your jobs will finish sooner than 60hrs then you can apply a Slurm QOS (Quality of Service) to your jobs which will significantly boost the priority of the job in the queue and apply an appropriate runtime limit to the job. The priority boost is most significant for jobs that run under a 4 hour time limit, followed by a 24hr QOS, with 60hr jobs getting no priority boost at all. 

 

GPU QOS Table

QOS Name Runtime (hrs) Priority Boost
Partition QOS
gpu_bmrc_partition_limits 60 0
gpu_bmrc_interactive_limits 12 0
User selectable QOS
gpu_bmrc_4hr 4

20000000

gpu_bmrc_24hr 24

10000000

Partition QOS are applied automatically when you select a partition for your job.
User selectable QOS can be applied at job submission (--qos gpu_bmrc_4hr) 

 

Note about limits

As GPUs are a limited resource under considerable demand we need to apply limits to usage to ensure that there is throughput for jobs from all projects and to allow for essential regular maintenance activities to be completed. 

We cannot extend the 60hr runtime for jobs in normal operation.

Checkpointing and increasing parallelisation by breaking work into smaller chunks are two common ways to complete your workloads within shorter runtimes. They will also improve the resilience of your workload to interruption.

The per-group limit to the number of GPUs that can be in use is 24. This applies across all partitions.

At least 1 GPU must be selected (--gres gpu:1) for a job to run.

 

For all jobs the GPU limits are: 24 GPU per project, min 1 GPU per job

 For jobs on the batch partitions: 60 hours max runtime

 For sessions on the interactive partition: 1 GPU per user max, 12 hour max runtime

  

If you require GPU resources for jobs that must run longer than 60hrs or you need direct instant access to the GPU or you want to maintain persistent sessions over a long time period, BMRC also provide GPU accelerated VMs in the BMRC private cloud. If you would like to discuss access to the cloud resources please send a request to bmrc-help@medsci.ox.ac.uk.

 

Hardware

Node

GPU Type

Slurm Features

Num GPU Cards

 GPU RAM per card

CPU Cores per GPU

RAM GB per GPU

CPU Compatibility

compg009

p100-sxm2-16gb

flash

4

16

6

91.2

Skylake

compg010

p100-sxm2-16gb

flash

4

16

6

91.2

Skylake

compg011

p100-sxm2-16gb

flash

4

16

6

91.2

Skylake

compg013

p100-sxm2-16gb

 

4

16

6

91.2

Skylake

compg016

v100-pcie-32gb

flash

2

32

6

750

Skylake

compg019

quadro-rtx6000

flash

4

24

8

91.2

Skylake

compg020

quadro-rtx6000

flash

4

24

8

91.2

Skylake

compg021

quadro-rtx6000

flash

4

24

8

91.2

Skylake

compg026

p100-pcie-16gb

flash

4

16

10

91.2

Skylake

compg027

v100-pcie-16gb

 

4

16

12

60.8

Skylake

compg028

quadro-rtx8000

flash

4

48

8

187.2

Cascadelake

compg029

quadro-rtx8000

flash

4

48

8

187.2

Cascadelake

compg030

quadro-rtx8000

flash

4

48

8

187.2

Cascadelake

compg031

a100-pcie-40gb

flash

4

40

8

91.2

Cascadelake

compg032

a100-pcie-40gb

flash

4

40

8

91.2

Cascadelake

compg033

a100-pcie-40gb

flash

4

40

8

91.2

Cascadelake

compg034

a100-pcie-40gb

flash

4

40

8

91.2

Cascadelake

compg035

a100-pcie-80gb

flash

4

80

8

91.2

Icelake

compg036

a100-pcie-80gb

flash

4

80

8

91.2

Icelake

compg037

a100-pcie-80gb

flash

2

80

24

256

Icelake

compg038

a100-pcie-80gb

flash

2

80

24

256

Icelake

compg039

a100-pcie-80gb

flash

4

80

12

128

Icelake

compg040

a100-pcie-80gb

flash

4

80

12

128

Icelake

compg041

a100-pcie-80gb

flash

80

12

128 

Icelake

compg042

a100-pcie-80gb

flash

4

80

12

128

Icelake

compg047

l4

flash

6

24

10

80

Emerald Rapids

 

LEGACY DEDICATED hardware

We maintain a number of GPU nodes which are dedicated to specific projects and experimental instrument workflows. Please email us with any questions regarding these dedicated nodes.

 

There are a small number of partitions dedicated to specific projects or instrument workflows

gpu_strubi 

gpu_cryosparc

 

Node

GPU Type

Num GPU cards

GPU RAM per card

CPU Cores

Total RAM GB

CPU Compatibility

compg017

v100-pcie-32gb

2

32

24

1500

Skylake

compg018

quadro-rtx6000

4

24

32

384

Skylake

compg022

v100-pcie-16gb

4

16

32

384

Skylake

compg024

quadro-rtx6000

4

24

32

384

Skylake

compg025

quadro-rtx8000

4

48

32

384

Skylake

compg043

l40s

4

48

12

128

Sapphire Rapids

compg044 

l40s

4

48

12

128

Sapphire Rapids

compg045

l40s

4

48

12

128

Sapphire Rapids

compg046

l40s

4

48

12

128

Sapphire Rapids

 

 

Submitting jobs

Jobs are submitted using sbatch in a similar way to submitting a non-gpu job; however, you must supply some extra parameters to indicate your GPU requirements as follows:

sbatch --account gpu_<X>.prj  --partition gpu_p100_16gb  --gres gpu:<N> <JOBSCRIPT>

gpu_<X>.prj is the name of the research group/project Slurm GPU account. 

<N> is the number of GPUs required for each job. 

 

The default number of CPU cores per GPU depends on the partition (see 'Selecting the GPU type'). You can request more (or fewer) CPU cores for your job with  --cpus-per-gpu <N>. Alternatively, you can set the total number of cores required for the job with -c <N>. <N> is the number of cores.

The default system memory available per GPU depends on the partition (see 'Selecting the GPU type'). You can request more (or less) system memory for your job with --mem-per-gpu <M>G. Alternatively, you can specify the total memory requirement for your job with --mem <M>G. <M> is the number of GB of memory required.

 

Examples:

Submit a job requiring a single A100 80GB GPU:

sbatch -A gpu_<X>.prj -p gpu_a100_80gb --gres gpu:1 <SCRIPT>

Submit a job requiring 2 GPUs, can be RTX8000 or A100 40GB, that will finish in under 24 hours:

sbatch -A gpu_<X>.prj -p gpu_rtx8000_48gb,gpu_a100_40gb --gres gpu:2 --qos gpu_bmrc_24hr <SCRIPT>

Submit a job for an interactive session:

srun -A gpu_<X>.prj -p gpu_inter --gres gpu:1 --pty bash

 

Using fast local scratch space

A number of nodes have fast local NVMe drives for jobs that require a lot of I/O. This space can be accessed from:

/flash/scratch

or from project specific folders in /flash on the nodes.

It is the users responsibility to create a project folder in scratch for their job.

In Slurm you can select nodes with a scratch folder with:

sbatch -A gpu_<X>.prj -p gpu_p100_16gb --gres gpu:1 --constraint "flash" <JOBSCRIPT>

The scratch folder is open to all jobs, so care should be taken to protect your data by placing it in subfolders with the correct permissions.

As the space on these drives is limited you should remove any data from the scratch space when the job is complete. A scheduled automatic deletion from /flash/scratch will be introduced.

 

Monitoring

In an interactive session you should use the nvidia-smi command to check what processes are running on the GPUs and top to check what is running on the CPUs.

 

You can attach an interactive session to a running job to run nvidia-smi, top or ps to monitor your running jobs with

srun --jobid <JOB_ID> --pty bash

 

 

 On the scheduled nodes, from a login node you should run e.g.

 squeue -p gpu_rtx8000_48gb,gpu_a100_40gb

 to see the jobs running and waiting in those GPU partitions.

 

You can see the occupancy of the GPUs for a partition with

sinfo -N -O "Nodelist:16,Partition,Available:6,Timelimit,CPUsState,StateCompact:8,Gres:32,GresUsed:32" -p gpu_a100_80gb