GPU Resources
GPU Resources
The BMRC cluster includes a number of NVidia GPU-accelerated servers in order to support AI/ML, image processing and other GPU-accelerated workflows.
If you have any questions or comments about using the GPU resources on BMRC please contact us (bmrc-help@medsci.ox.ac.uk)
RecenT Changes
Interactive sessions are available through Slurm using the gpu_interactive partition.
compg039-compg042 have been added into service.
VARIETIES OF GPU NODE
In our regular (i.e. non-GPU) cluster, there are groups of nodes (e.g. compa, compe, compf) where the hardware varies between groups but is identical within each group. The situation is different within the compg GPU nodes. Because of rapidly changing hardware capabilities, there is considerable variation in the hardware capabilities of the GPU nodes: they offer different combinations of CPU and RAM as well as different numbers and types of GPU card. Furthermore, each machine is configured to host only as many scheduler slots as it has GPU cards, on the assumption that every job will need at least one GPU card. In consequence, the Available RAM per slot on the GPU partition can vary widely from the minimum of 60.8GB of RAM up to 750GB.
Because of the variation in CPU, RAM, GPU card type and number of GPUs available per node, you may need to plan your job submissions carefully. The sections below provide full information on the nodes available in order to assist with your planning.
SCHEDULED GPU CLUSTER NODES
There are two Slurm partitions for GPU resources, gpu_short and gpu_long.
Jobs run on gpu_short have a maximum job duration of 4 hours.
Jobs run on gpu_long have a maximum job duration of 60 hours.
gpu_long is only available on a subset of nodes so it is recommended that you submit jobs to gpu_short when you can.
There is a partition for interactive jobs
gpu_interactive
There are Slurm partitions for specific resources/workflows, please contact us if you need access to these.
gpu_relion
gpu_long_palamara
gpu_long_zhang
gpu_cryosparc
Jobs are submitted to gpu_short (or gpu_long) using sbatch in a similar way to submitting a non-gpu job; however, you must supply some extra parameters to indicate your GPU requirements as follows:
sbatch -p gpu_short --gres gpu:<N> <JOBSCRIPT>
<N> is the number of GPUs required for each job.
Alternatively you can use
sbatch -p gpu_short --gpus-per-node <N> <JOBSCRIPT>
The recommended way to request GPUs for jobs on the BMRC Slurm GPU queues is to use --gres or --gpus-per-node
There are other options in Slurm for requesting GPUs including --gpus,--gpus-per-task and --gpus-per-socket. These are relevant for MPI workloads, and can lead to blocking reservations, so please contact BMRC before using them.
Optionally, you can specify one type of GPU to run on with e.g.:
sbatch -p gpu_short --gres gpu:a100-pcie-40gb:1 <JOBSCRIPT>
You can use Slurm Features/Constraints to specify the class(es) of GPU(s) that you wish your job to run on. The features are listed in the 'Slurm GPU Feature' column in the table below. For example, to run on P100 and A100 nodes only:
sbatch -p gpu_short --gpus-per-node 1 --constraint "p100|a100" <JOBSCRIPT>
The default number of CPU cores per GPU is 6. You can request more (or fewer) CPU cores for your job with --cpus-per-gpu <N>. Alternatively, you can set the total number of cores required for the job with -c <N>. <N> is the number of cores.
The default system memory available per GPU is 60.8 GB. You can request more (or less) system memory for your job with --mem-per-gpu <M>G. Alternatively, you can specify the total memory requirement for your job with --mem <M>G. <M> is the number of GB of memory required.
Node |
GPU Type |
Slurm Features |
Num GPU Cards |
GPU RAM per card |
CPU Cores per GPU |
RAM GB per GPU |
CPU Compatibility |
compg009 |
p100-sxm2-16gb |
p100, flash |
4 |
16 |
6 |
91.2 |
Skylake |
compg010 |
p100-sxm2-16gb |
p100,flash |
4 |
16 |
6 |
91.2 |
Skylake |
compg011 |
p100-sxm2-16gb |
p100, flash |
4 |
16 |
6 |
91.2 |
Skylake |
compg013 |
p100-sxm2-16gb |
p100 |
4 |
16 |
6 |
91.2 |
Skylake |
compg016 |
v100-pcie-32gb |
v100, flash |
2 |
32 |
6 |
750 |
Skylake |
compg026 |
p100-pcie-16gb |
p100, flash |
4 |
16 |
10 |
91.2 |
Skylake |
compg027 |
v100-pcie-16gb |
v100 |
4 |
16 |
12 |
60.8 |
Skylake |
compg028 |
quadro-rtx8000 |
rtx8000, flash |
4 |
48 |
8 |
187.2 |
Cascadelake |
compg029 |
quadro-rtx8000 |
rtx8000, flash |
4 |
48 |
8 |
187.2 |
Cascadelake |
compg030 |
quadro-rtx8000 |
rtx8000, flash |
4 |
48 |
8 |
187.2 |
Cascadelake |
compg031 |
a100-pcie-40gb |
a100, flash |
4 |
40 |
8 |
91.2 |
Cascadelake |
compg032 |
a100-pcie-40gb |
a100, flash |
4 |
40 |
8 |
91.2 |
Cascadelake |
compg033 |
a100-pcie-40gb |
a100, flash |
4 |
40 |
8 |
91.2 |
Cascadelake |
compg034 |
a100-pcie-40gb |
a100, flash |
4 |
40 |
8 |
91.2 |
Cascadelake |
compg035 |
a100-pcie-80gb |
a100, flash |
4 |
80 |
8 |
91.2 |
Icelake |
compg036 |
a100-pcie-80gb |
a100, flash |
4 |
80 |
8 |
91.2 |
Icelake |
compg037 |
a100-pcie-80gb |
a100, flash |
2 |
80 |
24 |
256 |
Icelake |
compg038 |
a100-pcie-80gb |
a100, flash |
2 |
80 |
24 |
256 |
Icelake |
compg039 |
a100-pcie-80gb |
a100, flash |
4 |
80 |
12 |
128 |
Icelake |
compg040 |
a100-pcie-80gb |
a100, flash |
4 |
80 |
12 |
128 |
Icelake |
compg041 |
a100-pcie-80gb |
a100, flash |
4 |
80 |
12 |
128 |
Icelake |
compg042 |
a100-pcie-80gb |
a100,flash |
4 |
80 |
12 |
128 |
Icelake |
INTERACTIVE GPU Sessions
You can get an interactive session using the gpu_interactive partition. The partition has a 12 hour runtime limit. Jobs submitted to this partition will get submitted to compg009, compg010.
srun -p gpu_interactive --gres gpu:1 --pty bash
LEGACY DEDICATED NODES
We maintain a number of GPU nodes which are dedicated to specific legacy projects. Please email us with any questions regarding these dedicated nodes.
Node |
GPU Type |
Num GPU cards |
GPU RAM per card |
CPU Cores |
Total RAM GB |
CPU Compatibility |
compg017 |
v100-pcie-32gb |
2 |
32 |
24 |
1500 |
Skylake |
compg018 |
quadro-rtx6000 |
4 |
24 |
32 |
384 |
Skylake |
compg019 |
quadro-rtx6000 |
4 |
24 |
32 |
384 |
Skylake |
compg020 |
quadro-rtx6000 |
4 |
24 |
32 |
384 |
Skylake |
compg021 |
quadro-rtx6000 |
4 |
24 |
32 |
384 |
Skylake |
compg022 |
v100-pcie-16gb |
4 |
16 |
32 |
384 |
Skylake |
compg023 |
quadro-rtx6000 |
4 |
24 |
32 |
384 |
Skylake |
compg024 |
quadro-rtx6000 |
4 |
24 |
32 |
384 |
Skylake |
compg025 |
quadro-rtx8000 |
4 |
48 |
32 |
384 |
Skylake |
FAST LOCAL SCRATCH SPACE
A number of nodes have fast local NVMe drives for jobs that require a lot of I/O. This space can be accessed from:
/flash/scratch
or from project specific folders in /flash on the nodes.
In Slurm you can select nodes with a scratch folder with:
sbatch -p gpu_short --gpus-per-node 1 --constraint "flash" <JOBSCRIPT>
This is folder is open to all jobs, so care should be taken to protect your data by placing it in subfolders with the correct permissions.
As the space on these drives is limited you should remove any data from the scratch space when the job is complete. A scheduled automatic deletion from /flash/scratch will be introduced.
MONITORING
In an interactive session you should use the nvidia-smi command to check what processes are running on the GPUs and top to check what is running on the CPUs.
You can attach an interactive session to a running job to run nvidia-smi, top or ps to monitor your running jobs with
srun --jobid <JOB_ID> --pty bash
On the scheduled nodes, from a login node you should run
squeue -p gpu_short,gpu_long
to see the jobs running and waiting in the GPU queue.
You can see the occupancy of the GPUs for a partition with
sinfo -N -O "Nodelist:16,Partition,Available:6,Timelimit,CPUsState,StateCompact:8,Gres:32,GresUsed:32" -p gpu_long
GPU SOFTWARE
The CUDA libraries are required to run applications on NVidia GPUs. More advanced GPUs require later versions of the cuda libraries. The CUDA page on wikipedia has useful information about versions. Software packages typically need to be compiled for a particular version of cuda.
Our pre-installed CUDA-related software is made available, in the same way as the majority of our pre-installed software, via software modules. Use module avail to see which software packages are available and module load <MODULE-NAME> to load your desired software modules.
In addition to the main CUDA libraries themselves, we also have:
- A number of other widely used GPU software packages.
You can also install your own software via e.g. a python virtualenv or conda.
RELION GPU
Information about running Relion on the Slurm GPU partitions will follow.