3. Snakemake setup¶
The example below will make use of a workflow and dataset I prepared for teaching purposes. It can be accessed at https://gitlab.com/schmeierlab/reproduce-tutorial.
Now the snakemake
command should be available.
3.1. Getting your workflow and data onto NeSI¶
You could of course just scp
or rsync
your workflow over to the Mahuika cluster.
However, I highly recommend that you use Git and a remote provider like GitLab or GitHub for this purpose.
You should version control your workflow in any case, thus an additional step of using a remote is not much more of an effort.
Once you have your workflow securely stored on a remote location getting it downloaded to NeSI is as easy as typing the following command (adjusting the URL to your workflow):
$ git clone https://gitlab.com/schmeierlab/reproduce-tutorial.git
# or if ssh has been set up
$ git clone [email protected]:schmeierlab/reproduce-tutorial.git
This workflow comes with some example data, so no need to up/download data for it separately.
However, for your own data you might want to use scp
to copy it to NeSI from your local computer.
If its a lot of data you might want to use the not backup’ed section of your project for this purpose.
This can be done easily with scp
from your local machine, e.g. a variation of:
$ scp -r data [email protected]:/nesi/nobackup/yourprojectcode/
3.2. An example workflow¶
3.2.1. Download the workflow¶
Here, we are going to use the example workflow mentioned in the last section to illustrate how to set up your workflow for running it on NeSI. You do not have to use this example workflow but can adjust your workflow based on the example below. Move to your project’s persistent folder and download it, if you have not already, with:
# change to folder
$ cd /nesi/project/yourprojectcode/
$ git clone https://gitlab.com/schmeierlab/reproduce-tutorial.git
# or if ssh has been set up
$ git clone [email protected]:schmeierlab/reproduce-tutorial.git
Note
If the cloning with Git does not work, you can download a zipped
archive of the whole repository here
https://gitlab.com/schmeierlab/reproduce-tutorial/-/archive/master/reproduce-tutorial-master.zip. The
locally on the command-line, you can type unzip
reproduce-tutorial-master.zip; mv reproduce-tutorial-master reproduce-tutorial
.
3.2.2. Inspect and setup the workflow¶
Let us inspect what we just downloaded:
# After successful download we can inspect it with
$ cd reproduce-tutorial
$ ls -1F
data/
envs/
examples/
fastq/
help/
LICENSE
logs/
README.md
Snakefile
This is data from an RNA-seq experiment.
We want to map the single end fastq-files in the folder fastq
to the yeast reference genome in the folder data
.
The Snakefile
is empty, as its being developed throughout the Reproducibility tutorial.
We replace this file with the final version of workflow like this:
$ rm Snakefile
$ cp examples/Snakefile_v7 Snakefile
The workflow consists of four rules that will trim the data, build an genome index, map the fastq-files to the genome and count reads per feature in the genome.
The workflow makes use of a few programs that can be installed on the fly by Snakemake when using the --use-conda
parameter for Snakemake.
However, I found that for large workflows that uses many environments, it will create many many files in the local .snakemake
directory and you may soon reach your file number limit on the cluster.
Thus, here, we are going to use one environment for the whole workflow that we are going to create upfront with:
# create env
$ conda env create -n tutorial -f data/nesi/cluster-nesi-condaenv.yaml
# activate env
$ conda activate tutorial
3.2.3. Slurm and NeSI specific setup¶
In order to have Snakemake distribute our jobs to the cluster, Snakemake will create automatically one job script per job it needs to run and submit it to the cluster.
However, in order to do so, Snakemake needs some information about the sbatch
parameters it should use.
We are going to do two things to run the workflow on the cluster:
- Setup a config-file that specifies
sbatch
parameters per rule - Use special Snakemake parameters to make use of these
sbatch
parameters
Lets look at the second part first. The command structure for Snakemake will look like this:
$ snakemake --jobs 999
--printshellcmds
--rerun-incomplete
--cluster-config data/nesi/cluster-nesi-mahuika.yaml
--cluster "sbatch --account={cluster.account}
--partition={cluster.partition}
--mem={cluster.mem}
--ntasks={cluster.ntasks}
--cpus-per-task={cluster.cpus-per-task}
--time={cluster.time}
--hint={cluster.hint}
--output={cluster.output}
--error={cluster.error}"
Attention
If you would want a per rule environment run you can specify the parameter --use-conda
. However, due to the above mentioned reason we are not using this here.
There are two parameters in the command that are cluster-specific, let’s have a look:
--cluster
: This parameter specifies thesbatch
submit command that Snakemake will use. It contains some wildcards (like the ones used within theSnakefile
). These wildcards will be replaced during job submission with rule-specific values. The values per rule will be specified in the cluster config file:data/nesi/cluster-nesi-mahuika.yaml
in the above example.--cluster-config
: This file contains the parameters forsbatch
on a per rule basis.
Let us have a look at the cluster config-file, first entry:
1 2 3 4 5 6 7 8 9 10 | __default__:
account: yourprojectcode
time: 00:10:00
ntasks: 1
cpus-per-task: 1
mem: 1500m
partition: large
hint: nomultithread
output: logs/%x-%j.out
error: logs/%x-%j.err
|
The first entry (__default__
) shown here is the default entry that Snakemake will use for all rules except if parameters are overwritten in rule-specific entries in this file.
For example, the third rule in the Snakefile
, called map
is mapping reads to the genome using bwa mem
.
Here, bwa mem
can make us of multiple cores at the same time to speed up things, e.g. we could decide to use 8 cores.
Thus, we could specify multithreading and on the large partition we could allocate up to 8*1.5GB = 12GB of memory to the job.
If we want more memory we would need to move to another partition, e.g. bigmem.
For example purposes, I want that the job runs with 20GB on the bigmem partition.
We specify in the cluster config-file an entry for the rule and overwrite the parameter specifying the multithreading option (line 16) and specifying the bigmem partition (line 14).
The entry only needs to specify the parameters we want to change from the __default__
entry, e.g. we also specify that this rule is allowed to use more time, here 20 minutes (line 13), than the default of 10 minutes.
Finally, we specify the memory (line 17) and number of cores the job should be using (line 15).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | __default__:
account: yourprojectcode
time: 00:10:00
ntasks: 1
cpus-per-task: 1
mem: 1500m
partition: large
hint: nomultithread
output: logs/%x-%j.out
error: logs/%x-%j.err
map:
time: 00:20:00
partition: bigmem
cpus-per-task: 8
hint: multithread
mem: 20g
|
Similarly, we add two more entries for rules makeidx
and featurecount
(lines 19 and 22). The featurecount
rule can make use of multiple cores as well, thus we add parameters here as well to change this.
However, we want them to be past to the large partition, so we keep the default (which we do not have to specify explicitly again).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | __default__:
account: yourprojectcode
time: 00:10:00
ntasks: 1
cpus-per-task: 1
mem: 1500m
partition: large
hint: nomultithread
output: logs/%x-%j.out
error: logs/%x-%j.err
map:
time: 00:20:00
partition: bigmem
cpus-per-task: 8
hint: multithread
mem: 20g
makeidx:
time: 00:20:00
featurecount:
time: 00:20:00
cpus-per-task: 4
hint: multithread
mem: 6g
|
The __default__
specifies that we do not allow multi-threading (hint: nomultithread
), we change this for e.g. rule map
to hint: multithread
and add the cpus-per-task
parameter, hence allowing up to 8 CPUs being reserved on the “bigmem” partition by jobs that use this rule.
Just remember that you still need to specify the threads: 8
in the Snakefile
for the rule map
, and make the threads available to bwa mem
in the shell command (see below lines 53 and 59).
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | rule map:
input:
reads = "analyses/results/{sample}.trimmed.fastq.gz",
idxdone = "data/makeidx.done"
output:
"analyses/results/{sample}.bam"
log:
"analyses/logs/{sample}.map"
benchmark:
"analyses/benchmarks/{sample}.map"
threads: 8
conda:
"envs/map.yaml"
params:
idx = "data/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa"
shell:
"bwa mem -t {threads} {params.idx} {input.reads} | "
"samtools view -Sbh > {output} 2> {log}"
|
Finally, when using the above command, Snakemake will create a job script for all jobs based on the rules and requested final targets in the Snakefile
and submit them via sbatch
with the correct configuration to the cluster.