The BMRC cluster includes a number of NVidia GPU-accelerated servers in order to support AI, image processing and other GPU-accelerated code. Our GPU-accelerated servers are all located within our G group and named in the format compg....
Access to the scheduled GPU nodes via qsub is restricted by default. If you would like to submit jobs to these GPU nodes please email us to request access.
We have a dedicated mailing list for BMRC GPU users: firstname.lastname@example.org If you wish to be added to the list, please email us.
Varieties of GPU node
In our regular (i.e. non-GPU) cluster, there are 4 separate groups of nodes (compc, compd, compe, compf) where the hardware varies between groups but is identical within each group. The situation is different, however, within the compg GPU nodes. Because of rapidly changing hardware capabilities, there is considerable variation in the hardware capabilities of the GPU nodes: they offer different combinations of CPU and RAM as well as different numbers and types of GPU card. Furthermore, each machine is configured to host only as many slots as it has GPU cards, on the assumption that every job will need at least one GPU card. In consequence, the RAM per slot on the GPU queue can vary widely from a minimum of 64GB of RAM up to 750GB.
Because of the variation in CPU, RAM, GPU card type and number of GPUs available per node, you may need to plan your job submissions carefully. The sections below provide full information on the nodes available in order to assist with your planning.
Scheduled GPU Cluster Nodes
There are seven nodes which are open to all users who request access to run GPU accelerated jobs through the Univa (UGE/SGE) scheduler.
There are 2 cluster queues for GPU resources, short.qg and long.qg.
Jobs run on short.qg have a maximum job duration of 4 hours.
Jobs run on long.qg have a maximum job duration of 60 hours.
long.qg is only available on a subset of nodes so it is recommended that you submit jobs to short.qg when you can.
Jobs are submitted to short.qg (or long.qg) using qsub in a similar way to submitting a non-gpu job; however, you must supply some extra parameters to indicate your GPU requirements as follows:
qsub -q short.qg -l gpu=N,gputype=XYZ ...
The specification of GPU parameters is preceded by -l (lowercase L).
- The gpu=N parameter is required and specifies how many GPU cards your job requires. If you do not specify this your job will fail to run and report error code 100 or Unable to run job: Invalid syntax for RSMAP complex "gpu". For example, use gpu=1 to request 1 GPU card, or gpu=2 to request 2 GPU cards. Note the maximum GPU cards available per node in the table below.
- The gputype=XYZ parameter is optional and specifies what type of GPU card your job requires. For information on the available numbers and types of GPU cards, please see the table below. The principal constraint is that your job must be able to run on a single node - so you cannot e.g. request more GPU cards of a certain type than the maximum that is available on a single node.
When submitting jobs, the total RAM requirement for your job will be compute RAM + GPU RAM. This means that you will need to request a sufficient number of slots to cover this total memory requirement.
The default memory per slot is 60.8G. To submit a job that requires more memory per slot use the qsub option:
where M is the amount of memory required per slot. Jobs will only run on nodes that have enough available memory per slot.
|Node||GPU Type||Num GPU Cards||GPU RAM per card||Num UGE Slots||CPU Cores per Slot||RAM GB per slot||
Interactive GPU Nodes
There are three nodes which are open to all users to log in and run GPU-accelerated applications. These nodes are intended to allow you to develop and test your GPU code before submitting to the GPU queue. Information about the interactive nodes appears in the table below.
To connect to the interactive GPU nodes, login to cluster1-2 and then e.g. ssh compg005 .
The interactive GPU nodes can be quite busy so please make an effort to check if somebody else is using the GPUs before setting your jobs running. Please see below for notes on how to monitor.
|Node||GPU Type||Num GPU cards||GPU RAM per card||CPU Cores||Total RAM GB||CPU Compatibility|
|compg005||GTX 1080 Ti||4||11||20||256||Ivybridge|
|compg006||GTX 1080 Ti||4||11||40 (hyperthreading)||256||Ivybridge|
|compg007||GTX 1080 Ti||4||11||40 (hyperthreading)||256||
|compg008||GTX 1080 Ti||4||11||16||256||
We maintain a number of nodes which are dedicated to specific projects. Please email us with any questions regarding these dedicated nodes.
FAST LOCAL SCRATCH SPACE
A number of nodes have fast local NVMe drives for jobs that require a lot of I/O. This space can be accessed from:
or from project specific folders in /flash on the nodes.
This is folder is open to all jobs, so care should be taken to protect your data by placing it in subfolders with the correct permissions.
As the space on these drives is limited you should remove any data from the scratch space when the job is complete. A scheduled automatic deletion from /flash/scratch will be introduced.
On the interactive nodes you should use the nvidia-smi command to check what processes are running on the GPUs and top to check what is running on the CPUs.
On the scheduled nodes, from a login node you should run
qstat -u "*" -q short.qg
to see the jobs running and waiting in the GPU queue.
The CUDA libraries are required to run applications on NVidia GPUs. More advanced GPUs require later versions of the cuda libraries. The CUDA page on wikipedia has useful information about versions. Software packages typically need to be compiled for a particular version of cuda.
Our pre-installed CUDA-related software is now made available, in the same way as the majority of our pre-installed software, via software modules. Use module avail to see which software packages are available and module load <MODULE-NAME> to load your desired software modules.
In addition to the main CUDA libraries themselves, we also have: