2. Jobs howto

Charging jobs to an account is new for the UCI community. Like any policy, it can be two-edged.

  • A large fraction of users should be able to run Allocated Jobs and never see the limits of their accounts.

  • Users who are running a very large number of Free Jobs are likely to have some of their free jobs preempted (killed).

In this section, we provide information about how to submit your jobs to Slurm, how to monitor them and how to request various resources.

Additional specific Job examples show in depth how to run array jobs, request GPUs, CPUs, and memory for a variety of different job types and common applications.

2.1. General Recommendations

Get the most from your allocation:

  • Look at your past jobs and see how many CPU and memory resource were used. Don’t request more than needed.

  • Prioritize your own work. Test and low-priority jobs can be free jobs. Others should be allocated.

  • Understand that free comes with no guarantees. Your free job can be killed at anytime.

Quota Enforcement

When users exceed their disk space or CPUs quota allocations the following will happen:

  • users will not be able to submit new jobs

  • running jobs will fail

Important

Please check the available disk space and CPU hours in your Slurm account regularly. Delete or archive data as needed.

Inherited environment

Any method of the job submission to SLURM (batch, interactive) will inherit all environment variables that were set in your login shell (on the login node where you are executing job submission).

By default there are none, unless one changes the environment via setting environment variables either on command line or in $HOME/.bashrc file or is using conda.

  • Check your $HOME/.bashrc and make sure that there are no environment variables set that can cause problems in the batch script or in srun command.

  • Do not load modules in your $HOME/.bashrc because this will change the environment for your login shell. One needs to load modules either in the SLURM submit script or on an interactive node after you executed the srun command and not in the login shell before the job submission.

  • If you are using conda please review Install with conda/mamba guide that explains how to cleanly separate conda-set environment from your login environment.

Details of jobs subscriptions, requesting resources, etc are described in detail in the sections below.

2.2. Batch Job

sbatch submit-script.sub

A batch job is run at sometime in the future by the scheduler. Submitting batch jobs to Slurm is done with the sbatch command and the job description is provided by the submit script. An example job submit script:

File simple.sub

#!/bin/bash

#SBATCH --job-name=test       ## Name of the job.
#SBATCH -A panteater_lab      ## CHANGE account to charge 
#SBATCH -p free               ## partition name
#SBATCH --nodes=1             ## (-N) number of nodes to use
#SBATCH --ntasks=1            ## (-n) number of tasks to launch
#SBATCH --cpus-per-task=1     ## number of cores the job needs
#SBATCH --error=slurm-%J.err  ## error log file
#SBATCH --output=slurm-%J.out ## output log file

# Run command hostname and save output to the file out.txt
hostname > out.txt

To submit a job on HPC3, login and using your favorite editor create simple.sub file with the contents shown above.

Edit the Slurm account to charge for the job to either your personal account or lab account. Your personal account is the same as your UCINetID.

To submit the job:
[user@login-x:~]$ sbatch simple.sub
Submitted batch job 21877983

When the job has been submitted, Slurm returns a job ID (here 21877983) that will be used to reference the job in Slurm user log files and Slurm job reports. After the job is finished, there will be 3 files created by the job:

slurm-21877983.err:

Slurm job error log file

slurm-21877983.out:

Slurm job output log file

out.txt:

Output file created by a specific command that was run in the job.

Note

Slurm error and log files are extremely useful especially to track progress and issues with the jobs.

2.3. Interactive job

srun

The command srun is used to submit an interactive job which runs in a shell terminal. The job uses your console for for standard input/output/error.

All interactive jobs must be run on a single node, they can not be run on multiple nodes.

Use this method:

when you want to test a short computation, compile software, or run an interactive Python or R session.

Do not use this method:

when your job runs for many hours or days. Use sbatch instead.

Important

srun submits jobs for execution but it does not bypass scheduler priority.
If your job cannot run immediately, you will wait until Slurm can schedule your request.

The main difference between srun and sbatch:

Srun

Sbatch

Interactive and blocking

Batch processing and non-blocking

You type commands interactively

Your commands run unattended

Can be used to create job steps in submit scripts

Can do everything srun can and more.

  1. Get an interactive node:

    [user@login-x:~]$ srun -c 2 -p free --pty /bin/bash -i
    srun: job 32654143 queued and waiting for resources
    srun: job 32654143 has been allocated resources
    [user@hpc3-y-z:~]$
    

    Once the srun command is executed, the scheduler allocates available resource and starts an interactive shell on the available node. Your shell prompt will indicate a new hostname.

  2. Execute your interactive commands

    [user@hpc3-y-z:~]$ module load python/3.10.2
    [user@hpc3-y-z:~]$ myProgRun.py -arg1 someDir/ -d outputDir/ -f file.nii -scale > outfile
    
  3. Once done with your work simply type at the prompt:

[user@hpc3-y-z:~]$ exit
Examples of interactive job requests:

Note, these options –pty /bin/bash -i must be the last on a command line and should not be separated by other options.

[user@login-x:~]$ srun -A PI_LAB --pty /bin/bash -i                     # 1
[user@login-x:~]$ srun -p free --pty /bin/bash -i                       # 2
[user@login-x:~]$ srun --mem=8G -p free --pty /bin/bash -i              # 3
[user@login-x:~]$ srun -c 4 --time=10:00:00 -N 1 --pty /bin/bash -i     # 4
[user@login-x:~]$ srun -p free-gpu --gres=gpu:V100:1 --pty /bin/bash -i # 5
[user@login-x:~]$ srun -p free --x11  --pty /bin/bash -i*               # 6
  1. use standard partition and charge to the PI_LAB account

  2. use free partition (where it may be killed at any time)

  3. use free partition and ask for 8Gb of memory per job (ONLY when you truly need it)

  4. use standard partition and ask for 4 CPUs for 10 hrs

  5. use free-gpu partition and ask for one GPU. DO NOT ask for more than 1 GPU!

  6. start an interactive session with Xforwarding enabled (option –x11) for GUI jobs.
    Note, a user should have logged on HPC3 with ssh Xforwarding enabled see Ssh and Xforward before running this srun command.

2.4. Attach to a job

srun --pty --jobid

Attention

The ssh access to compute nodes is turned off

Users will need to use Slurm commands to find a job ID and to attach to running jobs if they want to run simple jobs verification commands on the node where their job is running.

Once attached o a job, the user will be put on the node where the job is running (first listed if running on multi-node) and will run inside the cgroup (CPU, RAM etc.) of the running job. This means the user:

  • will be able to execute simple commands such as ls, top, ps, etc.

  • will not be able to start new processes that use resources outside of what is specified in jobid. Any command will use computing resources, and will add to the usage of the job.

  • needs to type exit after executing desired verification commands in order to stop attachment from the job. The original job will be still running.

Find jobid and attach to it:
[user@login-x:~]$ squeue -u panteater
  JOBID PARTITION     NAME      USER ST       TIME  NODES NODELIST(REASON)
3559123      free    Tst41 panteater  R   17:12:33      5 hpc3-14-02
3559124      free    Tst42 panteater  R   17:13:33      7 hpc3-14-17,hpc3-15-[05-08]

[user@login-x:~]$ srun --pty --jobid 3559123 --overlap /bin/bash
[user@hpc3-14-02:~]$

Execute your commands at the prompt and exit:

[user@hpc3-14-02:~]$ ls /tmp/panteater/
[user@hpc3-14-02:~]$ exit
[user@login-x:~]$
Attach to a specific node using -w switch (for multi-node jobs):
[user@login-x:~]$ srun --pty --jobid 3559124 --overlap -w hpc3-15-08 /bin/bash
[user@hpc3-15-08:~]$

Most often users just need to see the processes of the job, etc. Such commands can be run directly.

Run top command while attaching to the running job:
[user@login-x:~]$ srun --pty --overlap --jobid $JOBID top

2.5. Requesting Resources

2.5.1. Nodes

Single node jobs:

Most jobs on HPC3 including all interactive jobs are single node jobs and must be run on a single node.

Users should explicitly ask for 1 node. This is important to let SLURM know that all your processes should be on a single node and not spread over multiple nodes.

If a single node job is submitted to multiple nodes it will either:

  • fail

  • misuse the resources. You will be charged for reserved and unused resources.

In your submit script use:

#SBATCH --nodes=1                ## (-N) use 1 node
Multiple node jobs:

Very few applications that are compiled to run with OpenMPI or MPICH need to use multiple nodes. For such applications your submit script need to include number of nodes:

#SBATCH --nodes=2                ## (-N) use 2 nodes

2.5.2. Features/Constraints

HPC3 has a heterogeneous hardware with several different CPU types. You can request that a job only runs on nodes with certain features.

The features can be requested via a use of constraints. To request a feature/constraint, you must add to your submit script:

#SBATCH --constraint=<feature_name>

where <feature_name> is one of the defined features (or one of the standard features described in the Slurm sbatch guide.

We defined the following features for node selection:

Table 2.3 HPC3 Defined Features

Feature

name

Node Description

(processor/storage)

Node

count

Cores

min/mod/max

intel

any Intel node including HPC legacy

compute: 171

GPU: 32

24 / 40 / 48

32 / 40 / 40

avx512

Intel AVX512

compute: 166

GPU: 32

28 / 40 / 48

32 / 40 / 40

epyc or amd

any AMD EPYC

19

40 / 64 / 64

epyc7551

AMD EPYC 7551

3

40 / 64 / 64

epyc7601

AMD EPYC 7601

16

64 / 64 / 64

nvme or fastscratch

Intel AVX512 with /tmp on NVMe disk

66

32 / 48 / 48

mlx5_ib

Updated Infiniband firmware

131

36 / 40 / 64

mlx4_ib

Older Infiniband firmware

6

24 / 40 / 64

To request nodes with updated Infiniband firmware for your MPI-based jobs:

#SBATCH --constraint=mlx5_ib

To request nodes with a large local scratch storage:

#SBATCH --constraint=nvme
or
#SBATCH --constraint=fastscratch

See Scratch storage for details.

To request nodes with CPUs capable of AVX512 instructions:

#SBATCH --constraint=avx512

2.5.3. Scratch storage

Scratch storage is local to each compute node and is the fastest disk access for reading and writing the input/output job files.

Scratch storage is created for each job automatically as /tmp/ucinetid/jobid/ when the job starts on a compute node. Slurm knows this location and is referring to it via an environment variable $TMPDIR. Users don’t need to create $TMPDIR but simply need to use it in their submit scripts.

For example, a user panteater who has 2 running jobs:

[user@login-x:~]$ squeue -u panteater
 squeue
JOBID     PARTITION      NAME      USER  ACCOUNT ST      TIME CPUS NODE NODELIST(REASON)
20960254   standard  test-001 panteater   PI_lab  R   1:41:12   25    1 hpc3-15-08
20889321   standard  test-008 panteater   PI_lab  R  17:24:10   20    1 hpc3-15-08

will have the following directories created by Slurm for the job on hpc3-15-08

/tmp/panteater/20960254
/tmp/panteater/20889321

Note

While the directory is created automatically, it is A USER RESPONSIBILITY to copy files to this location and copy the final results back before the job ends.

Slurm doesn’t have any default amount of scratch space defined per job and that may be fine for most, but not all. The problem of having enough local scratch arises when nodes are shared by multiple jobs and users. One job can cause the other jobs running on the same node to fail, so please be considerate of your colleagues by doing the following for your job:

  1. Your job creates a few Gb of temporary data directly in $TMPDIR

    and handles the automatic creation and deletion of these temp files. Many Python, Perl, R, Java programs and 3rd party commercial software will write to $TMPDIR which is the default for many applications.

    You don’t need to do anything special. Do not reset $TMPDIR.

  2. Your job creates a few Gb of output in the directory where you run the job and does many frequent small file reads or writes (a few Kb every few minutes).

    You will need to use a scratch storage where you bring your job data, write temp files and then copy the final output files back when the job is done.

    Attention

    The reason is parallel filesystem (CRSP or DFS) is not suitable for small writes and reads and such operations need to be off-loaded to the local scratch area on the node where the job is executed. Otherwise you create an I/O problem not just for yourself but for many others who use the same filesystem.

    The following partial submit script shows how to use $TMPDIR for such jobs:

    <the rest of submit script is ommitted>
    
    #SBATCH --tmp=20G                 # requesting 20 GB (1 GB = 1,024 MB) local scratch
    
    # explicitly copy input files from DFS/CRSP to $TMPDIR
    # note, $TMPDIR is already created for your job by SLURM
    cd $TMPDIR
    cp /pub/myacount/path/to/my/jobs/data/*dbfiles  $TMPDIR
    
    # create a directory for the application output
    mkdir -p $TMPDIR/output
    
    # your job commands, this is just one possible example
    # output from application goes to $TMPDIR/output/
    mapp -tf 45 -o $TMPDIR/output     # program output directory is specified via -o flag
    mapp2  > $TMPDIR/output/mapp.out  # program output in a specific file
    
    # explicitly copy output files from $TMPDIR to DFS/CRSP
    mv $TMPDIR/output/* /pub/myaccount/myDesiredDir/
    

    In this scenario, Slurm job is run in $TMPDIR which is much faster for the disk I/O, then the program output is copied back as a big write which is much more efficient compare to many small writes.

  3. Your job creates many Gbs of temporary data (order of ~100Gb)

    You will need to submit your job to a node with a lot of local scratch storage where you bring your job data, write temp files, and then copy the final output files back when the job is done.

    In your submit script define how much scratch space your job needs (you may need to figure it out by trial test run) and request the nodes that have fast local scratch area via the following SLURM directives:

    #SBATCH --tmp=180G                 # requesting 180 GB (1 GB = 1,024 MB) local scratch
    #SBATCH --constraint=fastscratch   # requesting nodes with a lot of space in /tmp
    

    Folow the above (job type 2) submit script example to:

    - at job start explicitly copy input files from DFS/CRSP to $TMPDIR
    - at job end explicitly copy output files from $TMPDIR to DFS/CRSP

2.5.4. Memory

There are nodes with different memory footprints. Slurm uses Linux cgroups to enforce that applications do not use more memory/cores than they have been allocated.

Slurm has default and *max settings for a memory allocation per core for each partition. Please see all partitions settings in Available CPU partitions.

default settings:

Are used when a job submission script does not specify different memory allocation, and for most jobs this is sufficient.

max settings:

Are used when a job requires more memory. Job memory specifications can not exceed the partition’s max setting. If a job specifies a memory per CPU limit that exceeds the system limit, the job’s count of CPUs per task will automatically be increased. This may result in the job failing due to CPU count limits.

Note

Please do not override the memory defaults unless your particular job really requires it. Analysis of more than 3 Million jobs on HPC3 indicated that more than 98% of jobs fit within the defaults. With slightly smaller memory footprints, the scheduler has MORE choices as to where to place jobs on the cluster.

Note

For information how to get an access to higher memory partitions please see Higher Memory

When a job requires more memory:

the memory needs to be specified using one of the two mutually exclusive directives (one or another but not both):

–mem-per-cpu=X<specification> - memory per core
–mem=X<specification> total memory per job

where X is an integer and <specification> of an optional size specification (M - megabytes, G - gigabytes, T - terabytes). A default is in megabytes.

The same directives formats are used in Slurm submit scripts and in srun command for jobs in any partition.

If you want more memory for the job you should:
Scenario 1:
- ask for more total memory
Scenario 2:
- ask for max memory per core and if this is not enough
- request more cores.

You will be charged more for more cores, but you use a larger fraction of the node.

Examples of memory requests:

  1. Ask for the total job memory in submit script

    #SBATCH --mem=500           # requesting 500MB memory for the job
    #SBATCH --mem=4G            # requesting 4GB (1GB = 1,024MB) for the job
    
  2. Ask for the memory per CPU in submit script

    #SBATCH --mem-per-cpu=5000  # requesting 5000MB memory per CPU
    #SBATCH --mem-per-cpu=2G    # requesting 2GB memory per CPU
    
  3. Ask for 180 Gb for job in standard partition:

    #SBATCH --partition=standard
    #SBATCH --mem-per-cpu=6G    # requesting max memory per CPU
    #SBATCH --ntasks=30         # requesting 30 CPUs
    

    Ask for max memory per CPU and a number of CPUs to make up needed total memory for job as 30 x 6Gb = 180Gb

  4. Use srun and request 2 CPUs with a default or max memory

    [user@login-x:~]$ srun -p free --nodes=1 --ntasks=2 --pty /bin/bash -i
    [user@login-x:~]$ srun -p free --nodes=1 --ntasks=2 --mem-per-cpu=18G --pty /bin/bash -i
    [user@login-x:~]$ srun -p free --nodes=1 --ntasks=2 --mem=36G --pty /bin/bash -i
    
    The first job will have a total memory 2 x 3Gb = 6Gb
    The second and third job each will have a total memory 2 x 18Gb = 36Gb
  5. Use srun and request 4 CPUs and 10Gb memory per CPU,

    [user@login-x:~]$ srun -p free --nodes=1 --ntasks=4 --mem-per-cpu=10G --pty /bin/bash -i
    

    total memory for job is 4 x 10Gb = 40Gb

2.5.5. Runtime

Similar to memory limits, Slurm has default and max settings for a runtime for each partition. Please see all partitions settings in Available CPU partitions.

Important

All interactive jobs submitted with srun command and all batch jobs submitted with sbatch command have time limits whether you explicitly set them or not.

default settings:

are used when a job submission script or srun command do not specify different runtime, and for most jobs this is sufficient.

max settings:

specify the longest time a job can run in a given partition. Job time specifications can not exceed the partition’s max setting.

When a job requires longer run time than a default it needs to be specified using Slurm time directive –time=format (or the equivalent short notation -t format.

Acceptable time formats are :
minutes
minutes:seconds
hours:minutes:seconds
days-hours
days-hours:minutes
days-hours:minutes:seconds

For example, for Slurm script:

#SBATCH --time=5        # 5 minutes
#SBATCH -t 36:30:00     # 36 hrs and 30 min
#SBATCH -t 7-00:00:00   # 7 days

Similarly, for srun command:

srun --time=10 <other arguments>      # 10 minutes
srun -t 15:00:00  <other arguments>   # 15 hours
srun -t 5-00:00:00 <other arguments>  # 5 days

If your job and was submitted for the max default time and you realize it will not finish by the specified runtime limit you can ask for a runtime extension (not for free jobs). Please see change job time limit.

2.5.6. Mail notification

To receive email notification on the status of jobs, include the following lines in your submit scripts and make the appropriate modifications to the second line:

#SBATCH --mail-type=fail,end
#SBATCH --mail-user=user@domain.com

The first line specifies the event type for which a user requests an email (here failure/end events), the second specifies a valid email address. We suggest to use a very few event types especially if you submit hundreds of jobs. For more info, see output of man sbatch command.

Make sure to use your actual UCI-issued email address. While Slurm sends emails to any email address, we prefer you use your UCInetID@uci.edu email address. System administrators will use UCInetID@uci.edu if they need to contact you about a job.

Attention

DO NOT use mail event type ALL, BEGIN.
DO NOT enable email notification if you submit hundreds of jobs.
Sending an email for each job overloads Postfix server.

2.6. Monitoring

2.6.1. Status

squeue
scontrol show job
To check the status of your job in the queue:
[user@login-x:~]$ squeue -u panteater
   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
22877983  standard     test panteater R       0:03      1 hpc3-17-11

Attention

AVOID using command watch to query the Slurm queue in a continuous loop as in this command: watch -d squeue <...some arguments...>

This frequent querying of Slurm queue creates an unnecessary overhead and affects many users. Instead, check your job output and use Mail notification for the job end.

To get detailed info about the job:
[user@login-x:~]$ scontrol show job 22877983

The output will contain a list of key=value pairs that provide job information.

2.6.2. Account balance

sbank
zotledger

In order to run jobs on HPC3, a user must have available CPU hours.

  1. The sbank is short for Slurm Bank. It is used to display the balance of used and available hours to the user for a given account (defaults to the current user).

    Display the account balance for specific account:
    [user@login-x:~]$ sbank balance statement -a panteater
    User         Usage |     Account   Usage | Account Limit Available (CPU hrs)
    ---------- ------- + ----------- ------- + ------------- ---------
    panteater*      58 |   PANTEATER      58 |         1,000       942
    
    Display the account balances for specific user:
    [user@login-x:~]$ sbank balance statement -u panteater
    User        Usage |     Account    Usage | Account Limit Available (CPU hrs)
    ---------- ------ + ----------- -------- + ------------- ---------
    panteater*     58 |   PANTEATER       58 |         1,000       942
    panteater*  6,898 |      PI_LAB    6,898 |       100,000    93,102
    panteater*     84 | PANTEATER_LAB_GPU 84 |        33,000    32,916
    

    Note

    An hour of a GPU requires at least 2 CPU cores. Hence, the minimum charge for a single GPU is (32 + 2) = 34 SUs/hour.

  2. We have a cluster-specific tool to print a ledger of jobs based on specified arguments.

    Default is to print jobs of the current user for the last 30 days:
    [user@login-x:~]$ zotledger -u panteater
          DATE       USER   ACCOUNT PARTITION   JOBID JOBNAME ARRAYLEN CPUS WALLHOURS  SUs
    2021-07-21  panteater panteater  standard 1740043    srun        -    1      0.00 0.00
    2021-07-21  panteater panteater  standard 1740054    bash        -    1      0.00 0.00
    2021-08-03  panteater    lab021  standard 1406123    srun        -    1      0.05 0.05
    2021-08-03  panteater    lab021  standard 1406130    srun        -    4      0.01 0.02
    2021-08-03  panteater    lab021  standard 1406131    srun        -    4      0.01 0.02
        TOTALS          -         -         -       -       -        -    -      0.07 0.09
    
    To find all available arguments for this command use:
    [user@login-x:~]$ zotledger -h
    

2.6.3. Efficiency

sacct
seff
sstat

These are commands that provide info about resources consumed by the job.

use for running jobs:

sstat

use after the job completes:

sacct, seff

All commands need to use a valid jobid.

  1. The sstat displays various running job and job steps resource utilization information.

    For example, to print out a job’s average CPU time use (avecpu), average number of bytes written by all tasks (AveDiskWrite), average number of bytes read by all tasks (AveDiskRead), as well as the total number of tasks (ntasks) execute:

    [user@login-x:~]$ sstat -j 125610 --format=jobid,avecpu,aveDiskWrite,AveDiskRead,ntasks
           JobID     AveCPU AveDiskWrite  AveDiskRead   NTasks
    ------------ ---------- ------------ ------------ --------
    125610.batch 10-18:11:+ 139983973691 153840335902        1
    
  2. The sacct command can be used to see accounting data for all jobs and job steps and other useful info such how long job waited in the queue.

    Find accounting info about a specific job:
    [user@login-x:~]$ sacct -j 43223
           JobID  JobName  Partition      Account  AllocCPUS      State ExitCode
    ------------ -------- ---------- ------------ ---------- ---------- --------
       36811_374    array   standard panteater_l+          1  COMPLETED      0:0
    

    The command uses a default output format. A more useful example will set a specific format that provides extra information.

    Find detailed accounting info a job using specific format:
    [user@login-x:~]$ export SACCT_FORMAT="JobID,JobName,Partition,Elapsed,State,MaxRSS,AllocTRES%32"
    [user@login-x:~]$ sacct -j 600
    JobID      JobName  Partition  Elapsed     State  MaxRSS AllocTRES
    ---------- -------  --------  -------- --------- ------- --------------------------------
           600    all1  free-gpu  03:14:42 COMPLETED         billing=2,cpu=2,gres/gpu=1,mem=+
     600.batch   batch            03:14:42 COMPLETED 553856K           cpu=2,mem=6000M,node=1
    600.extern  extern            03:14:42 COMPLETED       0 billing=2,cpu=2,gres/gpu=1,mem=+
    
    • MaxRSS: shows your job memory usage.

    • AllocTRES: is trackable resources, these are the resources allocated to the job after the job started running. The %32 is a format specification to reserve 32 characters for this option in the output. Format specification can be used for any option.

    Find how long your jobs were queued (column Planned) before they started running:
    [user@login-x:~]$ export SACCT_FORMAT='JobID%20,Submit,Start,Elapsed,Planned'
    [user@login-x:~]$ sacct -j 30054126,30072212,30072182 -X
           JobID              Submit               Start    Elapsed    Planned
    ------------ ------------------- ------------------- ---------- ----------
        30054126 2024-07-14T11:17:00 2024-07-14T17:03:08   00:22:09   05:46:08
        30072182 2024-07-14T20:29:30 2024-07-14T20:31:16   00:05:20   00:01:46
        30072212 2024-07-14T20:44:14 2024-07-14T20:44:26   00:05:58   00:00:12
    

    Note

    Other useful options in SACCT_FORMAT are User, NodeList, ExitCode. To see all available options, run man sacct command.

  3. The seff Slurm efficiency script is used to find useful information about the job including the memory and CPU use and efficiency. Note, seff doesn’t produce accurate results for multi-node jobs. Use this command for single node jobs.

    [user@login-x:~]$ seff -j 423438
    Job ID: 423438
    Cluster: hpc3
    User/Group: panteater/panteater
    State: COMPLETED (exit code 0)
    Nodes: 1
    Cores per node: 8
    CPU Utilized: 00:37:34
    CPU Efficiency: 12.21% of 05:07:36 core-walltime
    Job Wall-clock time: 00:38:27
    Memory Utilized: 2.90 MB
    Memory Efficiency: 0.01% of 24.00 GB
    

    Important info is on CPU and Memory lines.

    CPU efficiency:

    at 12.21% the job used only a small portion of requested 8 CPUs

    Memory efficiency:

    at 0.011% the job used only a fraction of requested 24Gb of memory

    The user should fix the job submit script and ask for less memory per CPU and for fewer CPUs.

2.7. Pending

Jobs submitted to Slurm will start up as soon as the scheduler can find an appropriate resource depending on the availability of the nodes, job priority and job resources.

Lack of resources or insufficient account balance (status reason is AssocGrpCPUMinutesLimit or AssocGrpBillingMinutes) are the most common reasons that prevent a job from starting.

RCIC does not generally put limits in place unless we see excess, unreasonable impact to shared resources (often, file systems), or other fairness issues.

When a job is in PD (pending) status you need to determine why.

Important

The balance in the account must have enough core hours to cover the job request.

  • This applies to all jobs submitted with sbatch or srun.

  • This applies to ALL partitions, including free. While your job will not be charged when submitted to a free partition, there must be a sufficient balance for Slurm to begin your job.

2.7.1. Pending Job Reasons

While lack of resources or insufficient account balance are common reasons that prevent a job from starting, there are other possibilities.

To see the reasons of your pending jobs, you can run the squeue command with your account as:

[user@login-x:~]$ squeue -t PD -u peat
JOBID PARTITION NAME USER ACCOUNT ST TIME CPUS NODE NODELIST(REASON)
92005 standard  watA peat   p_lab PD 0:00    1    1 (ReqNodeNotAvail,Reserved for maintenance)
92008 standard  watA peat   p_lab PD 0:00    1    1 (ReqNodeNotAvail,Reserved for maintenance)
92011 standard  watA peat   p_lab PD 0:00    1    1 (ReqNodeNotAvail,Reserved for maintenance)
95475 free-gpu  7sMD peat   p_lab PD 0:00    2    1 (QOSMaxJobsPerUserLimit)
95476 free-gpu  7sMD peat   p_lab PD 0:00    2    1 (QOSMaxJobsPerUserLimit)

Most common reasons for job pending state and their explanations are summarized below.

AssocGrpCPUMinutesLimit:

Insufficient funds are available to run the job to completion. Slurm users MAX time a job might consume which is calculated as Number of cores x Number of hours requested for the job, and internally marks those hours as unavailable.

AssocGrpBillingMinutes:

Same as AssocGrpCPUMinutesLimit above.

Dependency:

Job has a user-defined dependency on a running job and cannot start until the previous job has completed.

DependencyNeverSatisfied:

Job has a user-defined dependency that failed. Job will never run and needs to be canceled.

Priority:

Slurm’s scheduler is temporarily holding the job in pending state because other queued jobs have a higher priority.

QOSMaxJobsPerUserLimit:

The user is already running the maximum number of jobs allowed by the particular partition.

ReqNodeNotAvail, Reserved for maintenance:

If the job were to run for the requested maximum time, it would run into a defined maintenance window. Job will not start until maintenance has been completed.

Resources:

The requested resource configuration is not currently available. If a job requests a resource combination that physically does not exist, the job will remain in this state forever.

To see all available job pending reasons and their definitions, please see output of man squeue command in the JOB REASON CODES section. A job may be waiting for more than one reason.

2.7.2. Pending job due to Maintenance

If your job is in pending status due to the reason ReqNodeNotAvail, Reserved for maintenance it means if your job is started now it will not complete by the time the scheduled maintenance starts. Slurm is holding your job because no jobs can run during the maintenance period. You either have to wait or you need to change your job requirements. See fix pending jobs.

2.7.3. Pending job due to Resources

If your job is in pending status due to the reason Resources it means the resources you requested are currently busy.

  1. Check Slurm estimate for the job start time:

    [user@login-x:~]$ squeue --start -j 325111
    JOBID  PARTITION NAME      USER ST          START_TIME NODES SCHEDNODES  NODELIST(REASON)
    325111      free  GEN panteater PD 2024-08-15T13:36:57     1 hpc3-14-00  (Resources)
    
    The estimated time start is listed under START_TIME.
    You either have to wait or you need to change your job requirements. See fix pending jobs.

2.7.4. Pending job in personal account

If your job is pending status due to the reason AssocGrpCPUMinutesLimit or AssocGrpBillingMinutes it means there is not enough balance left in your personal account to run your job. The checks are the same for both.

  1. Check your jobs status:

    [user@login-x:~]$ squeue -u panteater
    JOBID   PARTITION  NAME     USER    ACCOUNT ST TIME NODES NODELIST(REASON)
    1666961  standard  tst1 panteater panteater PD 0:00     1 (AssocGrpCPUMinutesLimit)
    1666962  standard  tst2 panteater panteater PD 0:00     1 (AssocGrpCPUMinutesLimit)
    

    Note, the reason is AssocGrpCPUMinutesLimit which means there is not enough balance left in the account. The job was submitted to use a personal account.

  2. Check your Slurm account balance

    [user@login-x:~]$ sbank balance statement -u panteater
    User        Usage |     Account   Usage | Account Limit Available (CPU hrs)
    ---------- ------ + ----------- ------- + ------------- ---------
    panteater*     58 |   PANTEATER      58 |         1,000       942
    panteater*  6,898 |      PI_LAB   6,898 |       100,000    93,102
    

    The account has 942 hours.

  3. Check your job requirements

    You can use scontrol show job <jobid> or a command below

    [user@login-x:~]$ squeue -o "%i %u %j %C %T %L %R" -p standard -t PD -u panteater
    JOBID        USER NAME CPUS   STATE     TIME_LEFT  NODELIST(REASON)
    1666961 panteater tst1  16  PENDING    3-00:00:00 (AssocGrpCPUMinutesLimit)
    1666962 panteater tst2  16  PENDING    3-00:00:00 (AssocGrpCPUMinutesLimit)
    

    Each jobs asks for 16 CPUs to run for 3 days which is

    \(16 * 24 * 3 = 1152\) core-hours, and it is more than 942 hours in the account balance.

    Attention

    These jobs will never be scheduled to run and need to be canceled

2.7.5. Pending job in LAB account

If your job is in pending status due to the reason AssocGrpCPUMinutesLimit or AssocGrpBillingMinutes it means there is not enough balance left in your lab account to run your job. The checks are the same for both.

Important

A lab account has a combined single balance and thus a single limit for all members of the lab.

Slurm will not start a new job if max time left of current jobs plus max time of queued jobs would cause the account to go negative.

Note

A user needs to check if there are any other jobs already running in the specified account and compute what is the time already requested and allocated by Slurm to all jobs on the LAB account.

  1. Check your jobs status

    [user@login-x:~]$ squeue -u panteater -t PD
    JOBID     PARTITION     NAME      USER ACCOUNT ST  TIME CPUS NODE NODELIST(REASON)
    12341501  standard  myjob_98 panteater  PI_lab PD  0:00    1    1 (AssocGrpCPUMinutesLimit)
    12341502  standard  myjob_99 panteater  PI_lab PD  0:00    1    1 (AssocGrpCPUMinutesLimit)
    
  2. Check the Slurm lab account balance

    [user@login-x:~]$ sbank balance statement -a PI_LAB
    User         Usage |  Account   Usage | Account Limit Available (CPU hrs)
    ---------- ------- + ----------- -----+ ------------- ---------
    panteater1       0 |   PI_LAB  75,800 |       225,000   67,300
    panteater2  50,264 |   PI_LAB  75,800 |       225,000   67,300
    panteater*  25,301 |   PI_LAB  75,800 |       225,000   67,300
    
  3. Check your job requirements

    [user@login-x:~]$ scontrol show job 12341501
    JobI4=12341501 JobName=myjob_98
       UserId=panteater(1234567) GroupId=panteater(1234567) MCS_label=N/A
       Priority=299 Nice=0 Account=PI_lab QOS=normal
       JobState=PENDING Reason=AssocGrpCPUMinutesLimit Dependency=(null)
       Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
       RunTime=00:00:00 TimeLimit=14-00:00:00 TimeMin=N/A
       SubmitTime=2023-01-18T16:36:06 EligibleTime=2023-01-18T16:36:06
       AccrueTime=2023-01-18T16:36:06
       StartTime=Unknown EndTime=Unknown Deadline=N/A
       NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
       TRES=cpu=1,mem=6G,node=1,billing=1
       <output  cut>
    
    Similar output is for the second job. Note the TimeLimit.
    For each of two pending jobs the resource request is:
    \(1 CPU * 14 days * 24 hrs = 336 hrs\)
  4. Check ALL the running jobs for your lab account

    [user@login-x:~]$ squeue -t R -A PI_lab -o "%.10i %.9P %.8j %.8u %.16a %.2t %.6C %l %L"
       JOBID PARTITION     NAME     USER       ACCOUNT ST   CPUS  TIME_LIMIT TIME_LEFT
    12341046  standard myjob_39  panteater      PI_lab  R      1 14-00:00:00 13-23:00:22
    12341047  standard myjob_40  panteater      PI_lab  R      1 14-00:00:00 13-23:00:22
    12341048  standard myjob_41  panteater      PI_lab  R      1 14-00:00:00 13-23:00:22
    < total 200 lines for 200 jobs >
    
    Each of 200 running jobs in the account has run for about 1hr out of allocated 14 days.
    Total max time Slurm has allocated for these running jobs is
    \(1 CPU * 200 jobs * 14 days * 24 hrs = 67200 hrs\)
    There are about 200 hrs already used, (each job already run for ~1 hr), so remaining needed
    balance is 67100 hrs. Per step 1 above, your 2 pending jobs require
    \(1 CPU * 14 days * 24 hrs * 2 jobs = 672 hrs\).
    Slurm is computing that if all current jobs ran to their MAX times and if the next job
    were to run MAX time your account would end up negative:
    \(67300 - 67100 - 672 = -472 hrs\).

    Therefore Slurm puts these new jobs on hold. These 2 jobs will start running once some of the remaining running jobs completed and the account balance is sufficient.

Important

It is important to correctly estimate time needed for the job, and not ask for more resources (time, cpu, memory) than needed.

2.7.6. Fix pending job

Fixes apply for batch jobs submitted with sbatch or for interactive jobs submitted with srun.

You will need to cancel existing pending job (it will never run):

[user@login-x:~]$ scancel <jobid>

Next, resubmit the job so that the requested execution hours can be covered by your bank account balance. Check and update the following in your submit script:

  1. If your job was run from personal account

    use a different Slurm account (lab) where you have enough balance
    SBATCH -A
  2. Lower requirements of your job so that requested resources will be no more than core hours available in your account. This may mean to use:

    fewer CPUs
    SBATCH –ntasks or #SBATCH –cpus-per-task
    fewer CPUs but with increased memory per CPU
    SBATCH –ntasks and #SBATCH –mem-per-cpu
    less memory
    SBATCH –mem or #SBATCH –mem-per-cpu
    set a time limit that is shorter than the default runtime
    SBATCH –time
  3. If your job is pending due to ReqNodeNotAvail, Reserved for maintenance you need to re-submit your job with a shorter time limit that will end BEFORE the maintenance begins.

    To find out the reservation details use:

    [user@login-x:~]$ scontrol show reservation
    ReservationName=RCIC: HPC3 scheduled maintenance StartTime=2024-03-27T08:00:00 EndTime=2024-03-28T08:00:00 Duration=1-00:00:00
       Nodes=hpc3-14-[00-31],... NodeCnt=228 CoreCnt=9936 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES
       TRES=cpu=9936
       Users=root,... Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
       MaxStartDelay=(null)
    

    The first output line includes the maintenance start time, end time and duration.

    Based on the info about the reservation and the current day/time you can estimate what time limit SBATCH –time should be specified for your job in order for it to finish before the maintenance starts.

    If your job truly needs requested time limit, nothing can be done until the maintenance is over. Remove your job from the queue and resubmit after the maintenance.

    If you did not specify time limit, the default time setting is in effect.

Please see Available CPU partitions for partitions default and max settings and Job examples for additional info.

2.8. Modification

It is possible to make some changes to jobs that are still waiting to run by using the scontrol command.

If changes need to be made for a running job, it may be better to kill the job and restart it after making the necessary changes.

Change job time limit:

If your job is already running and you have established that it will not finish by its current time limit you can request an extension of time limit.

Only the Slurm administrator can increase job’s time limit, therefore you need to submit a ticket indicating:

  • your jobid

  • your desired time extension

Note, we need to receive your request before your job’s current end time and your bank account must have sufficient funds to cover the desired time extension.

Change QOS:

By default, jobs are set to run with qos=normal. Users rarely need to change QOS.

[user@login-x:~]$ scontrol update jobid=<jobid> qos=[low|normal|high]

2.9. Cancel/Hold/Release

The following commands can be used to:

Cancel a specific job:
[user@login-x:~]$ scancel <jobid>
To cancel all jobs owned by a user:

This only applies to jobs that are associated with your accounts

[user@login-x:~]$ scancel -u <username>
To prevent a pending job from starting:
[user@login-x:~]$ scontrol hold <jobid>
To release held jobs to run:
[user@login-x:~]$ scontrol release <jobid>