The Slurm job scheduler is responsible for queuing the jobs and submitting them to compute nodes. Each job is thereby assigned a specific priority value - the higher this number, the higher up the job is inside the queue and will therefore start faster. However, the calculation of the priority depends on many factors and can be difficult to follow or compare to other jobs. Therefore, we listed some insights here to explain how this is done:

Formula

The formula used to calculate the priority consist of different factors and weights. On PALMA  we are using the following:

Job_priority =

(PriorityWeightAge) * (age_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor)

There can be more factors - have a look at https://slurm.schedmd.com/priority_multifactor.html

Factors

All factors are determined by Slurm itself and result in a number between 0.0 and 1.0. 

  • age_factor = 0.0 - 1.0 depending on how long the job is waiting in the queue. The maximum value of 1.0 is reached after 14 days.
  • fair-share_factor = 1.0 - 0.0 depending on how many resources a user has already used in the past. The amount of used resources, contributing to the fair-share factor, will be halved every 7 days! Therefore your fair-share value will increase when not using many resources for a while.
  • job_size_factor = 0.0 - 1.0 depending on the amount of resources asked to reserve for a job. A job asking for the complete cluster would get a factor of 1.0. Larger jobs are therefore slightly favored.

Weights

The weights can be configured by the administrators  making each factor more or less important. They can be shown by the sprio command:

sprio -w
          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE    JOBSIZE
        Weights                               1      20000     200000      10000

Showing job priorities

In the example below you can see the output of the squeue command and sprio command for four example jobs waiting in the normal partition (note that we are only comparing jobs in a single partition)

$ squeue -P -p normal --sort=-p,i --state=PD | head -n 5
JOBID      | PARTITION  |    STATE | CPUS | MIN_MEMORY | NODELIST(REASON)   |    TIME_LEFT | PRIORITY
    1   |    normal  |  PENDING | 1344 |      2500M |      (Resources)   |   7-00:00:00 |    23162
    2   |    normal  |  PENDING |   36 |         8G |      (Resources)   |   7-00:00:00 |    21534
    3   |    normal  |  PENDING |   36 |         8G |       (Priority)   |   7-00:00:00 |    21534
    4   |    normal  |  PENDING |   36 |         8G |       (Priority)   |   7-00:00:00 |    21534


$ sprio -lp normal --sort=-y,i | head -n 5
JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE    JOBSIZE
    1 normal         23162          0      20000       2375        788
    2 normal         21534          0      19310       2205         19
    3 normal         21534          0      19310       2205         19
    4 normal         21534          0      19310       2205         19

Backfilling

In addition to the normal queuing algorithm, Slurm also uses a so called backfilling algorithm. Slurm then tries to squeeze in smaller jobs in-between larger jobs - as long as those do not increase their waiting times. This allows for a much better overall cluster usage as many resources would otherwise just be idle, waiting for the next job. If a job was scheduled using backfilling,  can be seen if you run

scontrol show job <jobid>
...
... Scheduler=Backfill
...

Favoring larger Jobs

We configured the slurm scheduler so that it will slightly favor larger jobs, i.e. the settings in the slurm.conf is as follows:

  • PriorityFavorSmall=No

This setting will calculate the factor by dividing the requested number of cpus by the total number of cpus of the system


NCPUs/TotalCPUs = JobSizeFactor

e.g. requesting all resources of the cluster would lead to
TotalCPUs / TotalCPUs = 1.0

requesting only part of the cluster leads to
NCPUs/TotalCPUs < 1.0






  • Keine Stichwörter