3. Snakemake setup

The example below will make use of a workflow and dataset I prepared for teaching purposes. It can be accessed at https://gitlab.com/schmeierlab/reproduce-tutorial.

Now the snakemake command should be available.

3.1. Getting your workflow and data onto NeSI

You could of course just scp or rsync your workflow over to the Mahuika cluster. However, I highly recommend that you use Git and a remote provider like GitLab or GitHub for this purpose. You should version control your workflow in any case, thus an additional step of using a remote is not much more of an effort.

Once you have your workflow securely stored on a remote location getting it downloaded to NeSI is as easy as typing the following command (adjusting the URL to your workflow):

$ git clone https://gitlab.com/schmeierlab/reproduce-tutorial.git

# or if ssh has been set up
$ git clone [email protected]:schmeierlab/reproduce-tutorial.git

This workflow comes with some example data, so no need to up/download data for it separately.

However, for your own data you might want to use scp to copy it to NeSI from your local computer. If its a lot of data you might want to use the not backup’ed section of your project for this purpose. This can be done easily with scp from your local machine, e.g. a variation of:

$ scp -r data [email protected]:/nesi/nobackup/yourprojectcode/

Note

See here for more explanation of data transfer to/from NeSI via scp or Globus.

3.2. An example workflow

3.2.1. Download the workflow

Here, we are going to use the example workflow mentioned in the last section to illustrate how to set up your workflow for running it on NeSI. You do not have to use this example workflow but can adjust your workflow based on the example below. Move to your project’s persistent folder and download it, if you have not already, with:

# change to folder
$ cd /nesi/project/yourprojectcode/

$ git clone https://gitlab.com/schmeierlab/reproduce-tutorial.git

# or if ssh has been set up
$ git clone [email protected]:schmeierlab/reproduce-tutorial.git

Note

If the cloning with Git does not work, you can download a zipped archive of the whole repository here https://gitlab.com/schmeierlab/reproduce-tutorial/-/archive/master/reproduce-tutorial-master.zip. The locally on the command-line, you can type unzip reproduce-tutorial-master.zip; mv reproduce-tutorial-master reproduce-tutorial.

3.2.2. Inspect and setup the workflow

Let us inspect what we just downloaded:

# After successful download we can inspect it with
$ cd reproduce-tutorial
$ ls -1F
data/
envs/
examples/
fastq/
help/
LICENSE
logs/
README.md
Snakefile

This is data from an RNA-seq experiment. We want to map the single end fastq-files in the folder fastq to the yeast reference genome in the folder data. The Snakefile is empty, as its being developed throughout the Reproducibility tutorial. We replace this file with the final version of workflow like this:

$ rm Snakefile
$ cp examples/Snakefile_v7 Snakefile

The workflow consists of four rules that will trim the data, build an genome index, map the fastq-files to the genome and count reads per feature in the genome.

The workflow makes use of a few programs that can be installed on the fly by Snakemake when using the --use-conda parameter for Snakemake. However, I found that for large workflows that uses many environments, it will create many many files in the local .snakemake directory and you may soon reach your file number limit on the cluster. Thus, here, we are going to use one environment for the whole workflow that we are going to create upfront with:

# create env
$ conda env create -n tutorial -f data/nesi/cluster-nesi-condaenv.yaml

# activate env
$ conda activate tutorial

3.2.3. Slurm and NeSI specific setup

In order to have Snakemake distribute our jobs to the cluster, Snakemake will create automatically one job script per job it needs to run and submit it to the cluster. However, in order to do so, Snakemake needs some information about the sbatch parameters it should use. We are going to do two things to run the workflow on the cluster:

  • Setup a config-file that specifies sbatch parameters per rule
  • Use special Snakemake parameters to make use of these sbatch parameters

Lets look at the second part first. The command structure for Snakemake will look like this:

$ snakemake --jobs 999
            --printshellcmds
            --rerun-incomplete
            --cluster-config data/nesi/cluster-nesi-mahuika.yaml
            --cluster "sbatch --account={cluster.account}
                              --partition={cluster.partition}
                              --mem={cluster.mem}
                              --ntasks={cluster.ntasks}
                              --cpus-per-task={cluster.cpus-per-task}
                              --time={cluster.time}
                              --hint={cluster.hint}
                              --output={cluster.output}
                              --error={cluster.error}"

Attention

If you would want a per rule environment run you can specify the parameter --use-conda. However, due to the above mentioned reason we are not using this here.

There are two parameters in the command that are cluster-specific, let’s have a look:

  • --cluster: This parameter specifies the sbatch submit command that Snakemake will use. It contains some wildcards (like the ones used within the Snakefile). These wildcards will be replaced during job submission with rule-specific values. The values per rule will be specified in the cluster config file: data/nesi/cluster-nesi-mahuika.yaml in the above example.
  • --cluster-config: This file contains the parameters for sbatch on a per rule basis.

Let us have a look at the cluster config-file, first entry:

Listing 3.1 : A cluster file for the NeSI Mahuika cluster: data/nesi/cluster-nesi-mahuika.yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__default__:
   account: yourprojectcode
   time: 00:10:00
   ntasks: 1
   cpus-per-task: 1
   mem: 1500m
   partition: large
   hint: nomultithread
   output: logs/%x-%j.out
   error: logs/%x-%j.err

The first entry (__default__) shown here is the default entry that Snakemake will use for all rules except if parameters are overwritten in rule-specific entries in this file.

For example, the third rule in the Snakefile, called map is mapping reads to the genome using bwa mem. Here, bwa mem can make us of multiple cores at the same time to speed up things, e.g. we could decide to use 8 cores. Thus, we could specify multithreading and on the large partition we could allocate up to 8*1.5GB = 12GB of memory to the job. If we want more memory we would need to move to another partition, e.g. bigmem.

For example purposes, I want that the job runs with 20GB on the bigmem partition. We specify in the cluster config-file an entry for the rule and overwrite the parameter specifying the multithreading option (line 16) and specifying the bigmem partition (line 14). The entry only needs to specify the parameters we want to change from the __default__ entry, e.g. we also specify that this rule is allowed to use more time, here 20 minutes (line 13), than the default of 10 minutes. Finally, we specify the memory (line 17) and number of cores the job should be using (line 15).

Listing 3.2 : A cluster file for for NeSI Mahuika cluster: data/nesi/cluster-nesi-mahuika.yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
__default__:
   account: yourprojectcode
   time: 00:10:00
   ntasks: 1
   cpus-per-task: 1
   mem: 1500m
   partition: large
   hint: nomultithread
   output: logs/%x-%j.out
   error: logs/%x-%j.err

map:
   time: 00:20:00
   partition: bigmem
   cpus-per-task: 8
   hint: multithread
   mem: 20g

Similarly, we add two more entries for rules makeidx and featurecount (lines 19 and 22). The featurecount rule can make use of multiple cores as well, thus we add parameters here as well to change this. However, we want them to be past to the large partition, so we keep the default (which we do not have to specify explicitly again).

Listing 3.3 : A cluster file for the NeSI Mahuika cluster: data/nesi/cluster-nesi-mahuika.yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__default__:
   account: yourprojectcode
   time: 00:10:00
   ntasks: 1
   cpus-per-task: 1
   mem: 1500m
   partition: large
   hint: nomultithread
   output: logs/%x-%j.out
   error: logs/%x-%j.err

map:
   time: 00:20:00
   partition: bigmem
   cpus-per-task: 8
   hint: multithread
   mem: 20g

makeidx:
   time: 00:20:00

featurecount:
   time: 00:20:00
   cpus-per-task: 4
   hint: multithread
   mem: 6g

The __default__ specifies that we do not allow multi-threading (hint: nomultithread), we change this for e.g. rule map to hint: multithread and add the cpus-per-task parameter, hence allowing up to 8 CPUs being reserved on the “bigmem” partition by jobs that use this rule. Just remember that you still need to specify the threads: 8 in the Snakefile for the rule map, and make the threads available to bwa mem in the shell command (see below lines 53 and 59).

Listing 3.4 : Excerpt from the Snakefile, showing the bwa multi-threading setup.
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
rule map:
    input:
        reads = "analyses/results/{sample}.trimmed.fastq.gz",
        idxdone = "data/makeidx.done"
    output:
        "analyses/results/{sample}.bam"
    log:
        "analyses/logs/{sample}.map"
    benchmark:
        "analyses/benchmarks/{sample}.map"
    threads: 8
    conda:
        "envs/map.yaml"
    params:
        idx = "data/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa"
    shell:
        "bwa mem -t {threads} {params.idx} {input.reads} | "
        "samtools view -Sbh > {output} 2> {log}"

Finally, when using the above command, Snakemake will create a job script for all jobs based on the rules and requested final targets in the Snakefile and submit them via sbatch with the correct configuration to the cluster.