GPU Resources

The BMRC cluster includes a number of NVidia GPU-accelerated servers in order to support AI/ML, image processing and other GPU-accelerated workflows.

If you have any questions or comments about using the GPU resources on BMRC please contact us (bmrc-help@medsci.ox.ac.uk)

RecenT Changes

Interactive sessions are available through Slurm using the gpu_interactive partition.

compg039-compg042 have been added into service.

VARIETIES OF GPU NODE

In our regular (i.e. non-GPU) cluster, there are groups of nodes (e.g. compa, compe, compf) where the hardware varies between groups but is identical within each group. The situation is different within the compg GPU nodes. Because of rapidly changing hardware capabilities, there is considerable variation in the hardware capabilities of the GPU nodes: they offer different combinations of CPU and RAM as well as different numbers and types of GPU card. Furthermore, each machine is configured to host only as many scheduler slots as it has GPU cards, on the assumption that every job will need at least one GPU card. In consequence, the Available RAM per slot on the GPU partition can vary widely from the minimum of 60.8GB of RAM up to 750GB.

Because of the variation in CPU, RAM, GPU card type and number of GPUs available per node, you may need to plan your job submissions carefully. The sections below provide full information on the nodes available in order to assist with your planning.

SCHEDULED GPU CLUSTER NODES

There are two Slurm partitions for GPU resources, gpu_short and gpu_long.

Jobs run on gpu_short have a maximum job duration of 4 hours.

Jobs run on gpu_long have a maximum job duration of 60 hours.

gpu_long is only available on a subset of nodes so it is recommended that you submit jobs to gpu_short when you can.

There is a partition for interactive jobs

gpu_interactive

There are Slurm partitions for specific resources/workflows, please contact us if you need access to these.

gpu_relion

gpu_long_palamara

gpu_long_zhang

gpu_cryosparc

Jobs are submitted to gpu_short (or gpu_long) using sbatch in a similar way to submitting a non-gpu job; however, you must supply some extra parameters to indicate your GPU requirements as follows:

sbatch -p gpu_short --gres gpu:<N> <JOBSCRIPT>

<N> is the number of GPUs required for each job.

Alternatively you can use

sbatch -p gpu_short --gpus-per-node <N> <JOBSCRIPT>

The recommended way to request GPUs for jobs on the BMRC Slurm GPU queues is to use --gres or --gpus-per-node

There are other options in Slurm for requesting GPUs including --gpus,--gpus-per-task and --gpus-per-socket. These are relevant for MPI workloads, and can lead to blocking reservations, so please contact BMRC before using them.

Optionally, you can specify one type of GPU to run on with e.g.:

sbatch -p gpu_short --gres gpu:a100-pcie-40gb:1 <JOBSCRIPT>

You can use Slurm Features/Constraints to specify the class(es) of GPU(s) that you wish your job to run on. The features are listed in the 'Slurm GPU Feature' column in the table below. For example, to run on P100 and A100 nodes only:

sbatch -p gpu_short --gpus-per-node 1 --constraint "p100|a100" <JOBSCRIPT>

The default number of CPU cores per GPU is 6. You can request more (or fewer) CPU cores for your job with --cpus-per-gpu <N>. Alternatively, you can set the total number of cores required for the job with -c <N>. <N> is the number of cores.

The default system memory available per GPU is 60.8 GB. You can request more (or less) system memory for your job with --mem-per-gpu <M>G. Alternatively, you can specify the total memory requirement for your job with --mem <M>G. <M> is the number of GB of memory required.

Node	GPU Type	Slurm Features	Num GPU Cards	GPU RAM per card	CPU Cores per GPU	RAM GB per GPU	CPU Compatibility
compg009	p100-sxm2-16gb	p100, flash	4	16	6	91.2	Skylake
compg010	p100-sxm2-16gb	p100,flash	4	16	6	91.2	Skylake
compg011	p100-sxm2-16gb	p100, flash	4	16	6	91.2	Skylake
compg013	p100-sxm2-16gb	p100	4	16	6	91.2	Skylake
compg016	v100-pcie-32gb	v100, flash	2	32	6	750	Skylake
compg026	p100-pcie-16gb	p100, flash	4	16	10	91.2	Skylake
compg027	v100-pcie-16gb	v100	4	16	12	60.8	Skylake
compg028	quadro-rtx8000	rtx8000, flash	4	48	8	187.2	Cascadelake
compg029	quadro-rtx8000	rtx8000, flash	4	48	8	187.2	Cascadelake
compg030	quadro-rtx8000	rtx8000, flash	4	48	8	187.2	Cascadelake
compg031	a100-pcie-40gb	a100, flash	4	40	8	91.2	Cascadelake
compg032	a100-pcie-40gb	a100, flash	4	40	8	91.2	Cascadelake
compg033	a100-pcie-40gb	a100, flash	4	40	8	91.2	Cascadelake
compg034	a100-pcie-40gb	a100, flash	4	40	8	91.2	Cascadelake
compg035	a100-pcie-80gb	a100, flash	4	80	8	91.2	Icelake
compg036	a100-pcie-80gb	a100, flash	4	80	8	91.2	Icelake
compg037	a100-pcie-80gb	a100, flash	2	80	24	256	Icelake
compg038	a100-pcie-80gb	a100, flash	2	80	24	256	Icelake
compg039	a100-pcie-80gb	a100, flash	4	80	12	128	Icelake
compg040	a100-pcie-80gb	a100, flash	4	80	12	128	Icelake
compg041	a100-pcie-80gb	a100, flash	4	80	12	128	Icelake
compg042	a100-pcie-80gb	a100,flash	4	80	12	128	Icelake

INTERACTIVE GPU Sessions

You can get an interactive session using the gpu_interactive partition. The partition has a 12 hour runtime limit. Jobs submitted to this partition will get submitted to compg009, compg010.

srun -p gpu_interactive --gres gpu:1 --pty bash

LEGACY DEDICATED NODES

We maintain a number of GPU nodes which are dedicated to specific legacy projects. Please email us with any questions regarding these dedicated nodes.

Node	GPU Type	Num GPU cards	GPU RAM per card	CPU Cores	Total RAM GB	CPU Compatibility
compg017	v100-pcie-32gb	2	32	24	1500	Skylake
compg018	quadro-rtx6000	4	24	32	384	Skylake
compg019	quadro-rtx6000	4	24	32	384	Skylake
compg020	quadro-rtx6000	4	24	32	384	Skylake
compg021	quadro-rtx6000	4	24	32	384	Skylake
compg022	v100-pcie-16gb	4	16	32	384	Skylake
compg023	quadro-rtx6000	4	24	32	384	Skylake
compg024	quadro-rtx6000	4	24	32	384	Skylake
compg025	quadro-rtx8000	4	48	32	384	Skylake

FAST LOCAL SCRATCH SPACE

A number of nodes have fast local NVMe drives for jobs that require a lot of I/O. This space can be accessed from:

/flash/scratch

or from project specific folders in /flash on the nodes.

In Slurm you can select nodes with a scratch folder with:

sbatch -p gpu_short --gpus-per-node 1 --constraint "flash" <JOBSCRIPT>

This is folder is open to all jobs, so care should be taken to protect your data by placing it in subfolders with the correct permissions.

As the space on these drives is limited you should remove any data from the scratch space when the job is complete. A scheduled automatic deletion from /flash/scratch will be introduced.

MONITORING

In an interactive session you should use the nvidia-smi command to check what processes are running on the GPUs and top to check what is running on the CPUs.

You can attach an interactive session to a running job to run nvidia-smi, top or ps to monitor your running jobs with

srun --jobid <JOB_ID> --pty bash

On the scheduled nodes, from a login node you should run

squeue -p gpu_short,gpu_long

to see the jobs running and waiting in the GPU queue.

You can see the occupancy of the GPUs for a partition with

sinfo -N -O "Nodelist:16,Partition,Available:6,Timelimit,CPUsState,StateCompact:8,Gres:32,GresUsed:32" -p gpu_long

GPU SOFTWARE

The CUDA libraries are required to run applications on NVidia GPUs. More advanced GPUs require later versions of the cuda libraries. The CUDA page on wikipedia has useful information about versions. Software packages typically need to be compiled for a particular version of cuda.

Our pre-installed CUDA-related software is made available, in the same way as the majority of our pre-installed software, via software modules. Use module avail to see which software packages are available and module load <MODULE-NAME> to load your desired software modules.

In addition to the main CUDA libraries themselves, we also have:

cudNN
TensorFlow, PyTorch, Keras

A number of other widely used GPU software packages.

You can also install your own software via e.g. a python virtualenv or conda.

RELION GPU

Information about running Relion on the Slurm GPU partitions will follow.

Cookies on this website

GPU Resources

GPU Resources

RecenT Changes

VARIETIES OF GPU NODE

SCHEDULED GPU CLUSTER NODES

INTERACTIVE GPU Sessions

LEGACY DEDICATED NODES

FAST LOCAL SCRATCH SPACE

MONITORING

GPU SOFTWARE

RELION GPU