Shares and Scheduling on the BMRC Cluster

A description of Cluster Shares and how they affect job prioritization for the Slurm scheduler used on the BMRC Cluster.

Overview for PIs

The most commonly used BMRC service is the cluster. The cluster is the cheapest way of getting computing (on less sensitive data) performed, as jobs from all users are packed tightly onto the compute nodes using as many cores as possible. The cluster is very busy with jobs starting and finishing all the time. Typically, the cluster runs over 100 million jobs per year, meaning that new jobs are starting and old jobs are finishing about three times every second.

The model is that all users submit jobs and then wait while a piece of software, the scheduler (in this case Slurm), automatically decides the order in which these jobs are executed based on the resources requested by the job (for example, how many cores, how much memory, what type of GPU) and on how many resources are currently allocated to the user's project/group relative to the resources currently allocated to other projects/groups. Incidentally, a request for an interactive job is handled the same way as other jobs except that if the request cannot be answered immediately then the user either has to wait or the interactive job fails immediately.

Once a job has started it will have all the resources that it has requested and it will run flat out so that it consumes resources for as short a time as possible. The only decision that can be made is the order in which the different jobs start. While the scheduling algorithm is constantly being thought about and refined, the overall goal is very simple. All jobs submitted by users belong to a project/group and the number of shares allocated to the project/group is the key determinant of scheduling order. The target is to have the ratio of currently allocated resources for each project/group the same as the ratio of the number of shares purchased by each project/group. The deciding factor of which job starts next is the one that brings these currently-allocated ratios closest to the target ratios.

Note that shares are sold to the project/group and not to individual users. This means that all users in a project/group are treated as a single set of jobs for scheduling purposes. This largely protects projects/groups from adversely impacting each other. The way in which usage within a group is balanced is covered in the next section.

This approach to scheduling has proven itself over a period of 15 years to be very "research friendly": it encourages use and naturally allows bursts of work for a user while there is spare capacity, but it also pretty much guarantees a minimum level of service at the project/group level. It works very well for typical projects/groups across the Medical Sciences Division, where there are periods of intensive computing followed by periods of low activity (for example, data collection or lab work). In fallow times, resources are not reserved and idle, but shared out fairly among all the projects/groups that currently have jobs waiting to execute.

Very nominally, each share entitles a group to one CPU core year of compute time (8,760 CPU core hours), but unless the actual use diverges from that consistently over a period of time BMRC tries to maintain a constant and predictable charge to projects/groups which helps with grant management. The very nature of research means that accurate usage forecasting is impossible, and we find that focusing on selling shares acts as an effective "insurance policy" for the unexpected humps and bumps of active research.

The goal of BMRC is to help research groups be as successful as possible, since enabling high-impact research is the best way we can help future funding applications be successful which in turn will lead to further investment in BMRC. If there are compelling reasons to bend the scheduler rules resulting from hard, externally imposed deadlines relating to particularly high-impact research then the first step should be to email BMRC giving as much notice as possible. Otherwise, BMRC currently limits each user to a maximum usage of 600 cores as a way of ensuring that there is usually enough resource to meet all requests. Obviously, all sets of rules can be "gamed" and BMRC will act to change the rules if this gaming is causing problems for other users.

Overview for Users

The description above is what is known technically as a "share tree": the trunk is the total number of shares that have been sold and it divides into project/group branches where the weight of each branch is the number of shares that have been sold to each project/group. Note that this tree description does not mention users at all, and that implies that all scheduling decisions are made at the project/group level. That is true as far as the PI (and the cost to the PI) is concerned, but it would be unworkable for projects/groups with more than one user. The solution is to have more than one layer to the share tree where branches subdivide into sub-branches. This is where the users come in and it is explained in this section. Note that these second-level scheduling decisions are considered only after the top-level decisions have been made. The number of active users with submitted jobs in one project/group at any time should have no impact on other users in other projects/groups – these second-level decisions only affect how compute resources are shared within a project/group.

All projects/groups are actually set up by default with three Slurm projects: each branch divides into three sub-branches called group.prj.high, group.prj and group.prj.low. The number of shares assigned to each of these entities is 100, 10 and 1, respectively. Most of the time (and by default) all users submit jobs linked to group.prj, however, they are free to change this association. At the level of the project/group and relative to the resources allocated to group.prj, the scheduler aims to allocate 10x more resources to group.prj.high and 10x fewer resources to group.prj.low. Note, this is a second-level effect. Moving all jobs to group.prj.high does not affect how many jobs one project/group gets relative to another project/group: it merely controls the scheduling order within a project/group. This functionality is best used sparingly and it should be used with agreement from other users in your project/group, BMRC does not expect to have to mediate intra-group scheduling clashes. An example use case for group.prj.high would be when one user needs to get their work through before another user in the same group (for example, last minute calculations for paper revisions). An example use case for group.prj.low would be when one user wants to stack up lots of jobs for execution but is in no hurry for the results (for example, when they are going on holiday).

There is a third level to the share tree: sub branches dividing into sub-sub-branches (twigs). As before, this does not affect how many jobs one project/group gets relative to another project/group: it merely controls the scheduling order within a project/group. The behaviour is different for the different sub branches. For group.prj.high, it is assumed that these are high-priority jobs and they are therefore executed in the order in which they were submitted regardless of the user that submitted them. Beware, this means that one user in a group can effectively dominate the usage for the group/project. For group.prj and group.prj.low, it is assumed that all users are equally important and the target is to balance the resources currently allocated to each user within the project/group. This weighting is applied automatically and means, for example, that if one user starts work and submits lots of jobs slightly before a second user in the same project/group then the scheduler will preferentially start jobs for the second user until both users are consuming the same resources and thereafter it will try to maintain that balance as jobs start and finish or the load from other users across the cluster changes over time. One user having lots of waiting jobs has no effect on the scheduling priority of those jobs relative to jobs belonging to other users with only a handful of submitted jobs each: what is important is the relative amount of resources currently allocated to each user.

Technical details

In Slurm, it is possible to tune the order in which jobs get scheduled (job priority). BMRC uses the Slurm multifactor priority plugin combined with the Fair Tree Fairshare algorithm to determine job priority. See https://slurm.schedmd.com/SLUG19/Priority_and_Fair_Trees.pdf for more relevant information.

BMRC uses a multi-level share tree to determine job priority between groups and also within groups. For example, for a group called bloggs with three users, the structure might look like this:

Account	User	RawShares	NormShares	EffectvUsage	FairShare
bloggs		5	0.001696	0.000000
bloggs.prj.high		100	0.900901	0.000000
bloggs.prj.high	abc123	parent	0.900901	0.000000	0.765597
bloggs.prj.high	ijk456	parent	0.900901	0.000000	0.765597
bloggs.prj.high	xyz789	parent	0.900901	0.000000	0.765597
bloggs.prj.low		1	0.009009	0.000000
bloggs.prj.low	abc123	10	0.333333	0.000000	0.765597
bloggs.prj.low	ijk456	10	0.333333	0.000000	0.765597
bloggs.prj.low	xyz789	10	0.333333	0.000000	0.765597
bloggs.prj		10	0.090090	1.000000
bloggs.prj	abc123	10	0.333333	0.000000	0.764892
bloggs.prj	ijk456	10	0.333333	0.000000	0.764892
bloggs.prj	xyz789	10	0.333333	1.000000	0.764658

Each group has a raw Fairshare, which defines their share of resources relative to other groups. In the example, the bloggs has a top-level account bloggs with a raw Fairshare of 5.

For assigning job priority within each group there is a normal-priority account with a raw Fairshare of 10, a high-priority account with a raw Fairshare of 100 and a low-priority account with a Fairshare of 1.

All users within the normal account have an equal raw Fairshare of 10 and similarly for the low-priority account. Within the high-priority account, each user has the Fairshare of the parent account, which effectively means that within this account the jobs are handled on a first-come first-served basis.

The share value used in the calculation of FairShare is based on the normalized share at that level. The normalized share for an account or user is determined by the fraction of the account or user share at that level divided by the sum of all the shares at that level. Users under the high-priority account have a normalized share that is the same as the parent share.

Each group should have a share of the cluster determined by their normalized shares at the top level of the tree. Within each group, the high, low- and normal-priority accounts have unequal normalized shares at that level of the tree. For example, the high priority account has a normalized share of 0.900901 = 100 / (100 + 10 + 1). Finally, the three users under bloggs.prj each have a normalized share of 0.333333 at that level of the tree since there are three of them and each has an equal raw share and so each user should have an equal share of the share of the cluster within their group under the normal- and low-priority accounts. Conversely, users should be treated on a first-come first-served basis within their group under the high-priority account.

In the example above, the only current usage is by user xyz789 in bloggs.prj (as shown by EffectvUsage). The Fairshare for that user under this association (user + account) is therefore slightly lower than the other users under the same account.

For each waiting job, the current priority is calculated by the multifactor priority plugin which determines priority using several factors. The weights for all of these factors are set to zero except for PriorityWeightAge (which is set to 10,000,000) and PriorityWeightFairshare (which is set to 100,000,000). Weighting by age means that the scheduler assigns a half-life to its memory of recent usage, so if you have used a lot of compute recently then your priority will be somewhat down-weighted. BMRC sets the Fairshare weighting to be 10x more important than the age weighting. The half-life of the age factor (which the weight multiplies) is set to be quite short (1 hour), which approximates a "use-it-or-lose-it" approach to scheduling.

Finally, each user has a 600 current CPU core limit defined by the Quality of Service (QoS). Note that this might be set to a lower number in the future – perhaps to a multiple of the number of shares that a group has paid for – and also a calculation of the memory and GPU resources may be factored in.

Cookies on this website

Shares and Scheduling on the BMRC Cluster

Overview for PIs

Overview for Users

Technical details