1. Slurm primer

Slurm is an open source cluster management and job scheduling system for large and small Linux clusters and the job scheduler of choice on the NeSI HPC cluster.

Whenever you want to run several jobs, you need to submit your jobs to a job queue. Jobs in the queue are executed on the cluster based on their priority. The job priority depends on the hardware resources your job needs. If you need a lot of resources, your job will get a low priority and likely has to wait a while until it gets executed.

To be able to get a good priority, you need to know some particulars about your individual jobs. The most important information you need to know for each job are:

  • Time your job will likely run
  • Number of CPU your job needs
  • Amount of memory your job needs

To be as close as possible to the real requirements in your estimates is essential to get a good priority in the queue.

Note

Often you do not know how much time your jobs need to run. In this instance it is good to schedule only one or a few jobs and submit them with more resources they likely are using, e.g. 10 minutes instead of the real 2 minutes and see once the job is done, how long it took. Then you can submit all jobs with a better estimated time.

1.1. Slurm partitions

A partition in the Slurm system is a set of computing resources that are bundled and can be accessed separately, ie by sending your jobs to a particular queue of the partition. Mahuika (the NeSI cluster you are likely going to use) has several different partitions configured, see here.

Depending on the answers to those requirements for your jobs above, you need to submit your jobs to the correct Slurm partition queue.

For example, the main partition on Mahuika is called “large”. It has many computing nodes (226) with many CPUs each (72), but each node only has 108GB of memory available. As one of your jobs will like run on one node only, we could allocate a maximum of 108GB of memory to that job. However, on large we should not aim to use more than 1.5GB/core that our job uses. For example, if we would run a job with 10 cores, we should not allocate more than 15GB to that job. This is done so that the other unused cores on that node still have enough memory resources available to be used and are not “wasted”. Another requirement for the large partition is that a job can run a maximum allowed time of 3 days. This means if you have many short jobs that do not need much memory, this is the partition to use. It also means though if you have a job that needs lets say 50GB of memory and uses few cores or would run more than 3 days, this is the wrong partition to use.

Check the different partitions at the link above.

We can see the jobs currently submitted to the queue of a particular partition (e.g. here “large”) with the command squeue -p large. To only see the jobs of a particualr user you could type squeue -p large -u username.

1.2. Submitting a job with sbatch

To submit a job to the cluster one would generally write a job script and submit this script via the Slurm program sbatch (see here).

A job script is a relative simple affair, for example the following script (e.g. test.sh) would submit a python script (test.py) to the partition “large”:

Listing 1.1 : Simple bash script (test.sh) for slurm cluster submission.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#!/bin/bash
#SBATCH --account=yourProjectId       # The project id given to you by NeSI
#SBATCH --partition=large             # The partition we want to use
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=1g                      # Job memory request
#SBATCH --time=00:05:00               # Time limit hrs:min:sec
#SBATCH --output=logs/%x-%j.out       # Standard output log
#SBATCH --error=logs/%x-%j.err        # Standard error log

module load Miniconda3/4.4.10
echo "Running python script on a single CPU"
python test.py

This script could be submitted to the queue of partition “large” with the command: sbatch test.sh.

The only particular about this bash-script are the lines that start with #SBATCH. These tell sbatch to use the specified sbatch parameters, without us using them on the command-line like so:

$ sbatch --account=yourProjectId
         --partition=large
         --ntasks=1
         --mem=1g
         --time=00:05:00
         --output=logs/%x-%j.out
         --error=logs/%x-%j.err
         test.sh

Note

We aim at using Snakemake for job submission, thus we are not going to use any of these methods to submit our jobs. However, we need to understand the parameters of sbatch as well as the different partitions, as we need to specify them as well.