Nextflow Training
Nextflow Training
Nextflow Training
Nextflow training
Table of Contents
1. Environment setup
1.1. Requirements
1.2. Next ow installation
1.3. Training material
2. Get started with Next ow
2.1. Basic concepts
2.1.1. Processes and channels
2.1.2. Execution abstraction
2.1.3. Scripting language
2.2. Your rst script
2.3. Modify and resume
2.4. Pipeline parameters
3. Channels
3.1. Channel types
3.1.1. Queue channel
3.1.2. Value channels
3.2. Channel factories
3.2.1. value
3.2.2. from
3.2.3. of
3.2.4. fromList
3.2.5. fromPath
3.2.6. fromFilePairs
3.2.7. fromSRA
4. Processes
4.1. Script
4.1.1. Script parameters
4.1.2. Conditional script
4.2. Inputs
4.2.1. Input values
4.2.2. Input les
4.2.3. Input path
4.2.4. Combine input channels
4.2.5. Input repeaters
4.3. Outputs
4.3.1. Output values
4.3.2. Output les
4.3.3. Multiple output les
4.3.4. Dynamic output le names
4.3.5. Composite inputs and outputs
4.4. When
4.5. Directives
4.5.1. Exercise
https://seqera.io/training/#_channels 1/71
10/9/2020 Nextflow training
https://seqera.io/training/#_channels 2/71
10/9/2020 Nextflow training
7.5.1. Exercise
7.5.2. Recap
7.6. MultiQC report
7.6.1. Recap
7.7. Handle completion event
7.8. Bonus!
7.9. Custom scripts
7.9.1. Recap
7.10. Metrics and reports
7.11. Run a project from GitHub
7.12. More resources
8. Manage dependencies & containers
8.1. Docker hands-on
8.1.1. Run a container
8.1.2. Pull a container
8.1.3. Run a container in interactive mode
8.1.4. Your rst Docker le
8.1.5. Build the image
8.1.6. Add a software package to the image
8.1.7. Run Salmon in the container
8.1.8. File system mounts
8.1.9. Upload the container in the Docker Hub (bonus)
8.1.10. Run a Next ow script using a Docker container
8.2. Singularity
8.2.1. Create a Singularity images
8.2.2. Running a container
8.2.3. Import a Docker image
8.2.4. Run a Next ow script using a Singularity container
8.2.5. The Singularity Container Library
8.3. Conda/Bioconda packages
8.3.1. Bonus Exercise
8.4. BioContainers
8.5. More resources
9. Next ow con guration
9.1. Con guration le
9.1.1. Con g syntax
9.1.2. Con g variables
9.1.3. Con g comments
9.1.4. Con g scopes
9.1.5. Con g params
9.1.6. Con g env
9.1.7. Con g process
9.1.8. Con g Docker execution
9.1.9. Con g Singularity execution
9.1.10. Con g Conda execution
10. Deployments scenarios
10.1. Cluster deployment
10.2. Managing cluster resources
https://seqera.io/training/#_channels 3/71
10/9/2020 Nextflow training
1. Environment setup
1.1. Requirements
Nextflow can be used on any POSIX compatible system (Linux, OS X, etc). It requires Bash and Java 8 (or later, up to 12)
(http://www.oracle.com/technetwork/java/javase/downloads/index.html) to be installed.
BASH
1 curl get.nextflow.io | bash
2 mv nextflow ~/bin
https://seqera.io/training/#_channels 4/71
10/9/2020 Nextflow training
BASH
1 nextflow info
BASH
1 aws s3 sync s3://seqeralabs.com/public/nf-training .
It is designed around the idea that the Linux platform is the lingua franca of data science. Linux provides many simple
but powerful command-line and scripting tools that, when chained together, facilitate complex data manipulations.
Nextflow extends this approach, adding the ability to define complex program interactions and a high-level parallel
computational environment based on the dataflow programming model. Nextflow core features are:
Processes are executed independently and are isolated from each other, i.e. they do not share a common (writable)
state. The only way they can communicate is via asynchronous FIFO queues, called channels in Nextflow.
Any process can define one or more channels as input and output. The interaction between these processes, and
ultimately the pipeline execution flow itself, is implicitly defined by these input and output declarations.
https://seqera.io/training/#_channels 5/71
10/9/2020 Nextflow training
If not otherwise specified, processes are executed on the local computer. The local executor is very useful for pipeline
development and testing purposes, but for real world computational pipelines an HPC or cloud platform is often
required.
In other words, Nextflow provides an abstraction between the pipeline’s functional logic and the underlying execution
system. Thus it is possible to write a pipeline once and to seamlessly run it on your computer, a grid platform, or the
cloud, without modifying it, by simply defining the target execution platform in the configuration file.
It provides out-of-the-box support for major batch schedulers and cloud platforms:
Linux SLURM
PBS Works
Torque
Moab
HTCondor
Amazon Batch
Kubernetes
https://seqera.io/training/#_channels 6/71
10/9/2020 Nextflow training
Nextflow implements declarative domain specific language (DSL) simplifies the writing of writing complex data
analysis workflows as an extension of a general purpose programming language.
This approach makes Nextflow very flexible because allows to have in the same computing environment the benefit of
concise DSL that allow the handling of recurrent use cases with ease and the flexibility and power of a general purpose
programming language to handle corner cases, which may be difficult to implement using a declarative approach.
In practical terms Nextflow scripting is an extension of the Groovy programming language (https://groovy-lang.org/),
which in turn is a super-set of the Java programming language. Groovy can be considered as Python for Java in that is
simplifies the writing of code and is more approachable.
NEXTFLOW
1 #!/usr/bin/env nextflow
2
3 params.greeting = 'Hello world!'
4 greeting_ch = Channel.from(params.greeting)
5
6 process splitLetters {
7
8 input:
9 val x from greeting_ch
10
11 output:
12 file 'chunk_*' into letters
13
14 """
15 printf '$x' | split -b 6 - chunk_
16 """
17 }
18
19 process convertToUpper {
20
21 input:
22 file y from letters.flatten()
23
24 output:
25 stdout into result
26
27 """
28 cat $y | tr '[a-z]' '[A-Z]'
29 """
30 }
31
32 result.view{ it.trim() }
This script defines two processes. The first splits a string into files containing chunks of 6 characters. The second
receives these files and transforms their contents to uppercase letters. The resulting strings are emitted on the result
channel and the final output is printed by the view operator.
CMD
nextflow run hello.nf
https://seqera.io/training/#_channels 7/71
10/9/2020 Nextflow training
CMD
N E X T F L O W ~ version 20.01.0
Launching `hello.nf` [marvelous_plateau] - revision: 63f8ad7155
[warm up] executor > local
executor > local (3)
[19/c2f873] process > splitLetters [100%] 1 of 1 ✔
[05/5ff9f6] process > convertToUpper [100%] 2 of 2 ✔
HELLO
WORLD!
You can see that the first process is executed once, and the second twice. Finally the result string is printed.
It’s worth noting that the process convertToUpper is executed in parallel, so there’s no guarantee that the instance
processing the first split (the chunk Hello) will be executed before before the one processing the second split (the chunk
world!).
Thus, it is perfectly possible that you will get the final result printed out in a different order:
WORLD!
HELLO
The hexadecimal numbers, like 22/7548fa , identify the unique process execution. These numbers
are also the prefix of the directories where each process is executed. You can inspect the files
produced by them changing to the directory $PWD/work and using these numbers to find the
process-specific execution path.
This helps a lot when testing or modifying part of your pipeline without having to re-execute it from scratch.
For the sake of this tutorial, modify the convertToUpper process in the previous example, replacing the process script
with the string rev $x , so that the process looks like this:
NEXTFLOW
1 process convertToUpper {
2
3 input:
4 file y from letters.flatten()
5
6 output:
7 stdout into result
8
9 """
10 rev $y
11 """
12 }
Then save the file with the same name, and execute it by adding the -resume option to the command line:
https://seqera.io/training/#_channels 8/71
10/9/2020 Nextflow training
N E X T F L O W ~ version 20.01.0
Launching `hello.nf` [naughty_tuckerman] - revision: 22eaa07be4
[warm up] executor > local
executor > local (2)
[19/c2f873] process > splitLetters [100%] 1 of 1, cached: 1 ✔
[a7/a410d3] process > convertToUpper [100%] 2 of 2 ✔
olleH
!dlrow
You will see that the execution of the process splitLetters is actually skipped (the process ID is the same), and its
results are retrieved from the cache. The second process is executed as expected, printing the reversed strings.
The pipeline results are cached by default in the directory $PWD/work . Depending on your script,
this folder can take of lot of disk space. If your are sure you won’t resume your pipeline execution,
clean this folder periodically.
For the sake of this tutorial, you can try to execute the previous example specifying a different input string parameter,
as shown below:
The string specified on the command line will override the default value of the parameter. The output will look like
this:
N E X T F L O W ~ version 20.01.0
Launching `hello.nf` [wise_stallman] - revision: 22eaa07be4
[warm up] executor > local
executor > local (4)
[48/e8315b] process > splitLetters [100%] 1 of 1 ✔
[01/840ca7] process > convertToUpper [100%] 3 of 3 ✔
uojnoB
m el r
!edno
3. Channels
Channels are a key data structure of Nextflow that allows the implementation of reactive-functional oriented
computational workflows based on the Dataflow (https://en.wikipedia.org/wiki/Dataflow_programming) programming
paradigm.
They are used to logically connect tasks each other or to implement functional style data transformations.
https://seqera.io/training/#_channels 9/71
10/9/2020 Nextflow training
What FIFO means? That the data is guaranteed to be delivered in the same order as it is produced.
A queue channel is implicitly created by process output definitions or using channel factories methods such as
Channel.from (https://www.nextflow.io/docs/latest/channel.html#from) or Channel.fromPath
(https://www.nextflow.io/docs/latest/channel.html#frompath).
NEXTFLOW
1 ch = Channel.from(1,2,3)
2 println(ch) 1
3 ch.view() 2
Exercise
Try to execute this snippet, it will produce an error message.
NEXTFLOW
1 ch = Channel.from(1,2,3)
2 ch.view()
3 ch.view()
A queue channel can have one and exactly one producer and one and exactly one consumer.
https://seqera.io/training/#_channels 10/71
10/9/2020 Nextflow training
NEXTFLOW
1 ch = Channel.value('Hello')
2 ch.view()
3 ch.view()
4 ch.view()
It prints:
Hello
Hello
Hello
NEXTFLOW
1 ch1 = Channel.value() 1
3.2.2. from
The factory Channel.from allows the creation of a queue channel with the values specified as argument.
NEXTFLOW
1 ch = Channel.from( 1, 3, 5, 7 )
2 ch.view{ "value: $it" }
The first line in this example creates a variable ch which holds a channel object. This channel emits the values
specified as a parameter in the from method. Thus the second line will print the following:
value: 1
value: 3
value: 5
value: 7
3.2.3. of
The method Channel.of works in a similar manner to Channel.from , though it fixes some inconsistent behavior of
the latter and provides a better handling for range of values. For example:
NEXTFLOW
1 Channel
2 .of(1..23, 'X', 'Y')
3 .view()
3.2.4. fromList
https://seqera.io/training/#_channels 11/71
10/9/2020 Nextflow training
The method Channel.fromList creates a channel emitting the elements provided by a list objects specified as
argument:
NEXTFLOW
1 list = ['hello', 'world']
2
3 Channel
4 .fromList(list)
5 .view()
3.2.5. fromPath
The fromPath factory method create a queue channel emitting one or more files matching the specified glob pattern.
NEXTFLOW
1 Channel.fromPath( '/data/big/*.txt' )
This example creates a channel and emits as many items as there are files with txt extension in the /data/big
folder. Each element is a file object implementing the Path (https://docs.oracle.com/javase/8/docs/api/java/nio/file/Paths.html)
interface.
Two asterisks, i.e. ** , works like * but crosses directory boundaries. This syntax is generally used
for matching complete paths. Curly brackets specify a collection of sub-patterns.
Name Description
glob When true interprets characters * , ? , [] and {} as glob wildcards, otherwise handles them
as normal characters (default: true )
type Type of paths returned, either file , dir or any (default: file )
hidden When true includes hidden files in the resulting paths (default: false )
followLinks When true it follows symbolic links during directories tree traversal, otherwise they are
managed as files (default: true )
relative When true returned paths are relative to the top-most common directory (default: false )
checkIfExists When true throws an exception of the specified path do not exist in the file system (default:
false )
Learn more about the glob patterns syntax at this link (https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob).
Exercise
Use the Channel.fromPath method to create a channel emitting all files with the suffix .fq in the data/ggal/ and
any subdirectory, then print the file name.
3.2.6. fromFilePairs
The fromFilePairs method creates a channel emitting the file pairs matching a glob pattern provided by the user.
The matching files are emitted as tuples in which the first element is the grouping key of the matching pair and the
second element is the list of files (sorted in lexicographical order).
https://seqera.io/training/#_channels 12/71
10/9/2020 Nextflow training
NEXTFLOW
1 Channel
2 .fromFilePairs('/my/data/SRR*_{1,2}.fastq')
3 .view()
Name Description
type Type of paths returned, either file , dir or any (default: file )
hidden When true includes hidden files in the resulting paths (default: false )
followLinks When true it follows symbolic links during directories tree traversal, otherwise they are
managed as files (default: true )
size Defines the number of files each emitted item is expected to hold (default: 2). Set to -1 for any.
flat When true the matching files are produced as sole elements in the emitted tuples (default:
false ).
checkIfExists When true throws an exception of the specified path do not exist in the file system (default:
false )
Exercise
Use the fromFilePairs method to create a channel emitting all pairs of fastq read in the data/ggal/ directory and
print them.
Then use the flat:true option and compare the output with the previous execution.
3.2.7. fromSRA
The Channel.fromSRA method that makes it possible to query of NCBI SRA (https://www.ncbi.nlm.nih.gov/sra) archive and
returns a channel emitting the FASTQ files matching the specified selection criteria.
The query can be project ID or accession number(s) supported by the NCBI ESearch API
(https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch). For example the following snippet:
NEXTFLOW
1 Channel
2 .fromSRA('SRP043510')
3 .view()
https://seqera.io/training/#_channels 13/71
10/9/2020 Nextflow training
prints:
TEXT
1 [SRR1448794, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/004/SRR1448794/SRR1448794.fastq.gz]
2 [SRR1448795, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/005/SRR1448795/SRR1448795.fastq.gz]
3 [SRR1448792, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/002/SRR1448792/SRR1448792.fastq.gz]
4 [SRR1448793, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/003/SRR1448793/SRR1448793.fastq.gz]
5 [SRR1910483, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR191/003/SRR1910483/SRR1910483.fastq.gz]
6 [SRR1910482, ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR191/002/SRR1910482/SRR1910482.fastq.gz]
7 (remaining omitted)
NEXTFLOW
1 ids = ['ERR908507', 'ERR908506', 'ERR908505']
2 Channel
3 .fromSRA(ids)
4 .view()
TEXT
1 [ERR908507, [ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR908/ERR908507/ERR908507_1.fastq.gz, ftp://ftp.sra.ebi.ac.
2 [ERR908506, [ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR908/ERR908506/ERR908506_1.fastq.gz, ftp://ftp.sra.ebi.ac.
3 [ERR908505, [ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR908/ERR908505/ERR908505_1.fastq.gz, ftp://ftp.sra.ebi.ac.
It’s straightforward to use this channel as an input using the usual Nextflow syntax. For example:
NEXTFLOW
1 params.accession = 'SRP043510'
2 reads = Channel.fromSRA(params.accession)
3
4 process fastqc {
5 input:
6 tuple sample_id, file(reads_file) from reads
7
8 output:
9 file("fastqc_${sample_id}_logs") into fastqc_ch
10
11 script:
12 """
13 mkdir fastqc_${sample_id}_logs
14 fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads_file}
15 """
16 }
The code snippet above creates a channel containing 24 samples from a chromatin dynamics study and runs FASTQC
on the resulting files.
4. Processes
A process is the basic Nextflow computing primitive to execute foreign function i.e. custom scripts or tools.
The process definition starts with keyword the process , followed by process name and finally the process body
delimited by curly brackets. The process body must contain a string which represents the command or, more generally,
a script that is executed by it.
https://seqera.io/training/#_channels 14/71
10/9/2020 Nextflow training
NEXTFLOW
1 process sayHello {
2 """
3 echo 'Hello world!'
4 """
5 }
A process may contain five definition blocks, respectively: directives, inputs, outputs, when clause and finally the
process script. The syntax is defined as follows:
input: 2
4.1. Script
The script block is a string statement that defines the command that is executed by the process to carry out its task.
A process contains one and only one script block, and it must be the last statement when the process contains input
and output declarations.
The script block can be a simple string or multi-line string. The latter simplifies the writing of non trivial scripts
composed by multiple commands spanning over multiple lines. For example::
NEXTFLOW
1 process example {
2 script:
3 """
4 blastp -db /data/blast -query query.fa -outfmt 6 > blast_result
5 cat blast_result | head -n 10 | cut -f 2 > top_hits
6 blastdbcmd -db /data/blast -entry_batch top_hits > sequences
7 """
8 }
By default the process command is interpreted as a Bash script. However any other scripting language can be used just
simply starting the script with the corresponding Shebang (https://en.wikipedia.org/wiki/Shebang_(Unix)) declaration. For
example:
https://seqera.io/training/#_channels 15/71
10/9/2020 Nextflow training
NEXTFLOW
1 process pyStuff {
2 script:
3 """
4 #!/usr/bin/env python
5
6 x = 'Hello'
7 y = 'world!'
8 print "%s - %s" % (x,y)
9 """
10 }
This allows the compositing in the same workflow script of tasks using different programming
languages which may better fit a particular job. However for large chunks of code is suggested to
save them into separate files and invoke them from the process script.
NEXTFLOW
1 params.data = 'World'
2
3 process foo {
4 script:
5 """
6 echo Hello $params.data
7 """
8 }
A process script can contain any string format supported by the Groovy programming language. This
allows us to use string interpolation or multiline string as in the script above. Refer to String
interpolation for more information.
Since Nextflow uses the same Bash syntax for variable substitutions in strings, Bash environment
variables need to be escaped using \ character.
NEXTFLOW
1 process foo {
2 script:
3 """
4 echo "The current directory is \$PWD"
5 """
6 }
Try to modify the above script using $PWD instead of \$PWD and check the difference.
This can be tricky when the script uses many Bash variables. A possible alternative is to use a script string delimited by
single-quote characters
NEXTFLOW
1 process bar {
2 script:
3 '''
4 echo $PATH | tr : '\\n'
5 '''
6 }
https://seqera.io/training/#_channels 16/71
10/9/2020 Nextflow training
However this won’t allow any more the usage of Nextflow variables in the command script.
Another alternative is to use a shell statement instead of script which uses a different syntax for Nextflow
variable: !{..} . This allow to use both Nextflow and Bash variables in the same script.
NEXTFLOW
1 params.data = 'le monde'
2
3 process baz {
4 shell:
5 '''
6 X='Bonjour'
7 echo $X !{params.data}
8 '''
9 }
NEXTFLOW
1 params.aligner = 'kallisto'
2
3 process foo {
4 script:
5 if( params.aligner == 'kallisto' )
6 """
7 kallisto --reads /some/data.fastq
8 """
9 else if( params.aligner == 'salmon' )
10 """
11 salmon --reads /some/data.fastq
12 """
13 else
14 throw new IllegalArgumentException("Unknown aligner $params.aligner")
15 }
Exercise
Write a custom function that given the aligner name as parameter returns the command string to be executed. Then
use this function as the process script body.
4.2. Inputs
Nextflow processes are isolated from each other but can communicate between themselves sending values through
channels.
Inputs implicitly determine the dependency and the parallel execution of the process. The process execution is fired
each time a new data is ready to be consumed from the input channel:
https://seqera.io/training/#_channels 17/71
10/9/2020 Nextflow training
The input block defines which channels the process is expecting to receive inputs data from. You can only define one
input block at a time and it must contain one or more inputs declarations.
NEXTFLOW
input:
<input qualifier> <input name> from <source channel>
NEXTFLOW
1 num = Channel.from( 1, 2, 3 )
2
3 process basicExample {
4 input:
5 val x from num
6
7 """
8 echo process job $x
9 """
10 }
In the above example the process is executed three times, each time a value is received from the channel num and
used to process the script. Thus, it results in an output similar to the one shown below:
process job 3
process job 1
process job 2
https://seqera.io/training/#_channels 18/71
10/9/2020 Nextflow training
The channel guarantees that items are delivered in the same order as they have been sent - but -
since the process is executed in a parallel manner, there is no guarantee that they are processed in
the same order as they are received.
NEXTFLOW
1 reads = Channel.fromPath( 'data/ggal/*.fq' )
2
3 process foo {
4 input:
5 file 'sample.fastq' from reads
6 script:
7 """
8 your_command --reads sample.fastq
9 """
10 }
The input file name can also be defined using a variable reference as shown below:
NEXTFLOW
1 reads = Channel.fromPath( 'data/ggal/*.fq' )
2
3 process foo {
4 input:
5 file sample from reads
6 script:
7 """
8 your_command --reads $sample
9 """
10 }
The same syntax it’s also able to handle more than one input file in the same execution. Only change the channel
composition.
NEXTFLOW
1 reads = Channel.fromPath( 'data/ggal/*.fq' )
2
3 process foo {
4 input:
5 file sample from reads.collect()
6 script:
7 """
8 your_command --reads $sample
9 """
10 }
When a process declares an input file the corresponding channel elements must be file objects i.e.
created with the file helper function from the file specific channel factories e.g.
Channel.fromPath or Channel.fromFilePairs .
https://seqera.io/training/#_channels 19/71
10/9/2020 Nextflow training
NEXTFLOW
1 params.genome = 'data/ggal/transcriptome.fa'
2
3 process foo {
4 input:
5 file genome from params.genome
6 script:
7 """
8 your_command --reads $genome
9 """
10 }
The above code creates a temporary file named input.1 with the string data/ggal/transcriptome.fa as content.
That likely is not what you wanted to do.
NEXTFLOW
1 params.genome = "$baseDir/data/ggal/transcriptome.fa"
2
3 process foo {
4 input:
5 path genome from params.genome
6 script:
7 """
8 your_command --reads $genome
9 """
10 }
The path qualifier should be preferred over file to handle process input files when using Nextflow
19.10.0 or later.
Exercise
Write a script that creates a channel containing all read files matching the pattern data/ggal/*_1.fq followed by a
process that concatenates them into a single file and prints the first 20 lines.
NEXTFLOW
1 process foo {
2 echo true
3 input:
4 val x from Channel.from(1,2,3)
5 val y from Channel.from('a','b','c')
6 script:
7 """
8 echo $x and $y
9 """
10 }
Both channels emit three value, therefore the process is executed three times, each time with a different pair:
https://seqera.io/training/#_channels 20/71
10/9/2020 Nextflow training
(1, a)
(2, b)
(3, c)
What is happening is that the process waits until there’s a complete input configuration i.e. it receives an input value
from all the channels declared as input.
When this condition is verified, it consumes the input values coming from the respective channels, and spawns a task
execution, then repeat the same logic until one or more channels have no more content.
This means channel values are consumed serially one after another and the first empty channel cause the process
execution to stop even if there are other values in other channels.
What does it happen when not all channels have the same cardinality (i.e. they emit a different number of
elements)?
For example:
NEXTFLOW
1 process foo {
2 echo true
3 input:
4 val x from Channel.from(1,2)
5 val y from Channel.from('a','b','c','d')
6 script:
7 """
8 echo $x and $y
9 """
10 }
In the above example the process is executed only two time, because when a channel has no more data to be processed
it stops the process execution.
Note however that value channel do not affect the process termination.
To better understand this behavior compare the previous example with the following one:
NEXTFLOW
1 process bar {
2 echo true
3 input:
4 val x from Channel.value(1)
5 val y from Channel.from('a','b','c')
6 script:
7 """
8 echo $x and $y
9 """
10 }
Exercise
Write a process that is executed for each read file matching the pattern data/ggal/*_1.fq and use the same
data/ggal/transcriptome.fa in each execution.
https://seqera.io/training/#_channels 21/71
10/9/2020 Nextflow training
NEXTFLOW
1 sequences = Channel.fromPath('data/prots/*.tfa')
2 methods = ['regular', 'expresso', 'psicoffee']
3
4 process alignSequences {
5 input:
6 path seq from sequences
7 each mode from methods
8
9 """
10 t_coffee -in $seq -mode $mode
11 """
12 }
In the above example every time a file of sequences is received as input by the process, it executes three tasks running
an alignment with a different value for the mode option. This is useful when you need to repeat the same task for a
given set of parameters.
Exercise
Extend the previous example so a task is executed for each read file matching the pattern data/ggal/*_1.fq and
repeat the same task both with salmon and kallisto .
4.3. Outputs
The output declaration block allows to define the channels used by the process to send out the results produced.
There can be defined at most one output block and it can contain one or more outputs declarations. The output block
follows the syntax shown below:
output:
<output qualifier> <output name> into <target channel>[,channel,..]
NEXTFLOW
1 methods = ['prot','dna', 'rna']
2
3 process foo {
4 input:
5 val x from methods
6
7 output:
8 val x into receiver
9
10 """
11 echo $x > file
12 """
13 }
14
15 receiver.view { "Received: $it" }
https://seqera.io/training/#_channels 22/71
10/9/2020 Nextflow training
NEXTFLOW
1 process randomNum {
2
3 output:
4 file 'result.txt' into numbers
5
6 '''
7 echo $RANDOM > result.txt
8 '''
9 }
10
11 numbers.view { "Received: " + it.text }
In the above example the process randomNum creates a file named result.txt containing a random number.
Since a file parameter using the same name is declared in the output block, when the task is completed that file is sent
over the numbers channel. A downstream process declaring the same channel as input will be able to receive it.
NEXTFLOW
1 process splitLetters {
2
3 output:
4 file 'chunk_*' into letters
5
6 '''
7 printf 'Hola' | split -b 1 - chunk_
8 '''
9 }
10
11 letters
12 .flatMap()
13 .view { "File: ${it.name} => ${it.text}" }
it prints:
When a two stars pattern ** is used to recourse across directories, only file paths are matched i.e. directories are
not included in the result list.
Exercise
Remove the flatMap operator and see out the output change. The documentation for the flatMap operator is
available at this link (https://www.nextflow.io/docs/latest/operator.html#flatmap).
https://seqera.io/training/#_channels 23/71
10/9/2020 Nextflow training
When an output file name needs to be expressed dynamically, it is possible to define it using a dynamic evaluated
string which references values defined in the input declaration block or in the script global context. For example::
NEXTFLOW
1 process align {
2 input:
3 val x from species
4 file seq from sequences
5
6 output:
7 file "${x}.aln" into genomes
8
9 """
10 t_coffee -in $seq > ${x}.aln
11 """
12 }
In the above example, each time the process is executed an alignment file is produced whose name depends on the
actual value of the x input.
When using channel emitting tuple of values the corresponding input declaration must be declared with a tuple
qualifier followed by definition of each single element in the tuple.
In the same manner output channel emitting tuple of values can be declared using the tuple qualifier following by
the definition of each tuple element in the tuple.
NEXTFLOW
1 reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')
2
3 process foo {
4 input:
5 tuple val(sample_id), file(sample_files) from reads_ch
6 output:
7 tuple val(sample_id), file('sample.bam') into bam_ch
8 script:
9 """
10 your_command_here --reads $sample_id > sample.bam
11 """
12 }
13
14 bam_ch.view()
In previous versions of Nextflow tuple was called set but it was used exactly with the same
semantic. It can still be used for backward compatibility.
Exercise
Modify the script of the previous exercise so that the bam file is named as the given sample_id .
4.4. When
The when declaration allows you to define a condition that must be verified in order to execute the process. This can
be any expression that evaluates a boolean value.
It is useful to enable/disable the process execution depending the state of various inputs and parameters. For example:
https://seqera.io/training/#_channels 24/71
10/9/2020 Nextflow training
NEXTFLOW
1 params.dbtype = 'nr'
2 params.prot = 'data/prots/*.tfa'
3 proteins = Channel.fromPath(params.prot)
4
5 process find {
6 input:
7 file fasta from proteins
8 val type from params.dbtype
9
10 when:
11 fasta.name =~ /^BB11.*/ && type == 'nr'
12
13 script:
14 """
15 blastp -query $fasta -db nr
16 """
17 }
4.5. Directives
Directive declarations allow the definition of optional settings that affect the execution of the current process without
affecting the semantic of the task itself.
They must be entered at the top of the process body, before any other declaration blocks (i.e. input , output , etc).
Directives are commonly used to define the amount of computing resources to be used or other meta directives like
that allows the definition of extra information for configuration or logging purpose. For example:
NEXTFLOW
1 process foo {
2 cpus 2
3 memory 8.GB
4 container 'image/name'
5
6 script:
7 """
8 your_command --this --that
9 """
10 }
4.5.1. Exercise
Modify the script of the previous exercise adding a tag (https://www.nextflow.io/docs/latest/process.html#tag) directive logging
the sample_id in the execution output.
The pipeline result files need to be marked explicitly using the directive publishDir
(https://www.nextflow.io/docs/latest/process.html#publishdir) in the process that’s creating such file. For example:
https://seqera.io/training/#_channels 25/71
10/9/2020 Nextflow training
NEXTFLOW
1 process makeBams {
2 publishDir "/some/directory/bam_files", mode: 'copy'
3
4 input:
5 file index from index_ch
6 tuple val(name), file(reads) from reads_ch
7
8 output:
9 tuple val(name), file ('*.bam') into star_aligned
10
11 """
12 STAR --genomeDir $index --readFilesIn $reads
13 """
14 }
The above example will copy all bam files created by the star task in the directory path
/some/directory/bam_files .
The publish directory can be local or remote. For example output files could be stored to a AWS S3
bucket (https://aws.amazon.com/s3/) just using the s3:// prefix in the target path.
NEXTFLOW
1 params.reads = 'data/reads/*_{1,2}.fq.gz'
2 params.outdir = 'my-results'
3
4 Channel
5 .fromFilePairs(params.reads, flat: true)
6 .set{ samples_ch }
7
8 process foo {
9 publishDir "$params.outdir/$sampleId/", pattern: '*.fq'
10 publishDir "$params.outdir/$sampleId/counts", pattern: "*_counts.txt"
11 publishDir "$params.outdir/$sampleId/outlooks", pattern: '*_outlook.txt'
12
13 input:
14 set sampleId, file('sample1.fq.gz'), file('sample2.fq.gz') from samples_ch
15 output:
16 file "*"
17 script:
18 """
19 < sample1.fq.gz zcat > sample1.fq
20 < sample2.fq.gz zcat > sample2.fq
21
22 awk '{s++}END{print s/4}' sample1.fq > sample1_counts.txt
23 awk '{s++}END{print s/4}' sample2.fq > sample2_counts.txt
24
25 head -n 50 sample1.fq > sample1_outlook.txt
26 head -n 50 sample2.fq > sample2_outlook.txt
27 """
28 }
The above example will create an output structure in the directory my-results , which contains a separate sub-
directory for each given sample ID each of which contain the folders counts and outlooks .
5. Operators
Built-in functions applied to channels
https://seqera.io/training/#_channels 26/71
10/9/2020 Nextflow training
3 square.view() 3
NEXTFLOW
1 Channel.from(1,2,3,4)
2 .map { it -> it * it }
3 .view()
Filtering operators
Transforming operators
Splitting operators
Combining operators
Forking operators
Maths operators
https://seqera.io/training/#_channels 27/71
10/9/2020 Nextflow training
NEXTFLOW
1 Channel
2 .from('foo', 'bar', 'baz')
3 .view()
It prints:
foo
bar
baz
An optional closure parameter can be specified to customize how items are printed. For example:
NEXTFLOW
1 Channel
2 .from('foo', 'bar', 'baz')
3 .view { "- $it" }
It prints:
- foo
- bar
- baz
5.2.2. map
The map operator applies a function of your choosing to every item emitted by a channel, and returns the items so
obtained as a new channel. The function applied is called the mapping function and is expressed with a closure as
shown in the example below:
NEXTFLOW
1 Channel
2 .from( 'hello', 'world' )
3 .map { it -> it.reverse() }
4 .view()
A map can associate to each element a generic tuple containing any data as needed.
NEXTFLOW
1 Channel
2 .from( 'hello', 'world' )
3 .map { word -> [word, word.size()] }
4 .view { word, len -> "$word contains $len letters" }
Exercise
Use fromPath to create a channel emitting the fastq files matching the pattern data/ggal/*.fq , then chain with a
map to return a pair containing the file name and the path itself. Finally print the resulting channel.
NEXTFLOW
1 Channel.fromPath('data/ggal/*.fq')
2 .map { file -> [ file.name, file ] }
3 .view { name, file -> "> file: $name" }
5.2.3. into
The into operator connects a source channel to two or more target channels in such a way the values emitted by the
source channel are copied to the target channels. For example:
https://seqera.io/training/#_channels 28/71
10/9/2020 Nextflow training
NEXTFLOW
1 Channel
2 .from( 'a', 'b', 'c' )
3 .into{ foo; bar }
4
5 foo.view{ "Foo emits: " + it }
6 bar.view{ "Bar emits: " + it }
Note the use in this example of curly brackets and the ; as channel names separator. This is needed
because the actual parameter of into is a closure which defines the target channels to which the
source one is connected.
5.2.4. mix
The mix operator combines the items emitted by two (or more) channels into a single channel.
NEXTFLOW
1 c1 = Channel.from( 1,2,3 )
2 c2 = Channel.from( 'a','b' )
3 c3 = Channel.from( 'z' )
4
5 c1 .mix(c2,c3).view()
1
2
a
3
b
z
The items in the resulting channel have the same order as in respective original channel, however
there’s no guarantee that the element of the second channel are append after the elements of the
first. Indeed in the above example the element a has been printed before 3 .
5.2.5. atten
The flatten operator transforms a channel in such a way that every tuple is flattened so that each single entry is
emitted as a sole element by the resulting channel.
NEXTFLOW
1 foo = [1,2,3]
2 bar = [4, 5, 6]
3
4 Channel
5 .from(foo, bar)
6 .flatten()
7 .view()
1
2
3
4
5
6
5.2.6. collect
https://seqera.io/training/#_channels 29/71
10/9/2020 Nextflow training
The collect operator collects all the items emitted by a channel to a list and return the resulting object as a sole
emission.
NEXTFLOW
1 Channel
2 .from( 1, 2, 3, 4 )
3 .collect()
4 .view()
[1,2,3,4]
5.2.7. groupTuple
The groupTuple operator collects tuples (or lists) of values emitted by the source channel grouping together the
elements that share the same key. Finally it emits a new tuple object for each distinct key collected.
NEXTFLOW
1 Channel
2 .from( [1,'A'], [1,'B'], [2,'C'], [3, 'B'], [1,'C'], [2, 'A'], [3, 'D'] )
3 .groupTuple()
4 .view()
It shows:
This operator is useful to process altogether all elements for which there’s a common property or a grouping key.
Exercise
Use fromPath to create a channel emitting the fastq files matching the pattern data/ggal/*.fq , then use a map to
associate to each file the name prefix. Finally group together all files having the same common prefix.
5.2.8. join
The join operator creates a channel that joins together the items emitted by two channels for which exits a matching
key. The key is defined, by default, as the first element in each item emitted.
NEXTFLOW
1 left = Channel.from(['X', 1], ['Y', 2], ['Z', 3], ['P', 7])
2 right= Channel.from(['Z', 6], ['Y', 5], ['X', 4])
3 left.join(right).view()
[Z, 3, 6]
[Y, 2, 5]
[X, 1, 4]
https://seqera.io/training/#_channels 30/71
10/9/2020 Nextflow training
5.2.9. branch
The branch operator allows you to forward the items emitted by a source channel to one or more output channels,
choosing one out of them at a time.
The selection criteria is defined by specifying a closure that provides one or more boolean expression, each of which is
identified by a unique label. On the first expression that evaluates to a true value, the current item is bound to a named
channel as the label identifier. For example:
NEXTFLOW
1 Channel
2 .from(1,2,3,40,50)
3 .branch {
4 small: it < 10
5 large: it > 10
6 }
7 .set { result }
8
9 result.small.view { "$it is small" }
10 result.large.view { "$it is large" }
The branch operator returns a multi-channel object i.e. a variable that holds more than one
channel object.
GROOVY
1 println("Hello, World!")
The only difference between the two is that the println method implicitly appends a new line character to the
printed string.
parenthesis for function invocations are optional. Therefore also the following is a valid syntax.
GROOVY
1 println "Hello, World!"
6.2. Comments
Comments use the same syntax as in the C-family programming languages:
https://seqera.io/training/#_channels 31/71
10/9/2020 Nextflow training
GROOVY
1 // comment a single config file
2
3 /*
4 a comment spanning
5 multiple lines
6 */
6.3. Variables
To define a variable, simply assign a value to it:
GROOVY
1 x = 1
2 println x
3
4 x = new java.util.Date()
5 println x
6
7 x = -3.1499392
8 println x
9
10 x = false
11 println x
12
13 x = "Hi"
14 println x
GROOVY
1 def x = 'foo'
6.4. Lists
A List object can be defined by placing the list items in square brackets:
GROOVY
1 list = [10,20,30,40]
You can access a given item in the list with square-bracket notation (indexes start at 0 ) or using the get method:
GROOVY
1 assert list[0] == 10
2 assert list[0] == list.get(0)
In order to get the length of the list use the size method:
GROOVY
1 assert list.size() == 4
Lists can also be indexed with negative indexes and reversed ranges.
GROOVY
1 list = [0,1,2]
2 assert list[-1] == 2
3 assert list[-1..0] == list.reverse()
https://seqera.io/training/#_channels 32/71
10/9/2020 Nextflow training
GROOVY
1 assert [1,2,3] << 1 == [1,2,3,1]
2 assert [1,2,3] + [1] == [1,2,3,1]
3 assert [1,2,3,1] - [1] == [2,3]
4 assert [1,2,3] * 2 == [1,2,3,1,2,3]
5 assert [1,[2,3]].flatten() == [1,2,3]
6 assert [1,2,3].reverse() == [3,2,1]
7 assert [1,2,3].collect{ it+3 } == [4,5,6]
8 assert [1,2,3,1].unique().size() == 3
9 assert [1,2,3,1].count(1) == 2
10 assert [1,2,3,4].min() == 1
11 assert [1,2,3,4].max() == 4
12 assert [1,2,3,4].sum() == 10
13 assert [4,2,1,3].sort() == [1,2,3,4]
14 assert [4,2,1,3].find{it%2 == 0} == 4
15 assert [4,2,1,3].findAll{it%2 == 0} == [4,2]
6.5. Maps
Maps are like lists that have an arbitrary type of key instead of integer. Therefore, the syntax is very much aligned.
GROOVY
1 map = [a:0, b:1, c:2]
Maps can be accessed in a conventional square-bracket syntax or as if the key was a property of the map.
GROOVY
1 assert map['a'] == 0 1
2 assert map.b == 1 2
3 assert map.get('c') == 2 3
To add data or to modify a map, the syntax is similar to adding values to list:
GROOVY
1 map['a'] = 'x' 1
2 map.b = 'y' 2
3 map.put('c', 'z') 3
https://seqera.io/training/#_channels 33/71
10/9/2020 Nextflow training
Double-quoted strings can contain the value of an arbitrary variable by prefixing its name with the $ character, or the
value of any expression by using the ${expression} syntax, similar to Bash/shell scripts:
GROOVY
1 foxtype = 'quick'
2 foxcolor = ['b', 'r', 'o', 'w', 'n']
3 println "The $foxtype ${foxcolor.join()} fox"
4
5 x = 'Hello'
6 println '$x + $y'
GROOVY
1 The quick brown fox
2 $x + $y
Note the different use of $ and ${..} syntax to interpolate value expressions in a string literal.
Finally string literals can also be defined using the / character as delimiter. They are known as slashy strings and are
useful for defining regular expressions and patterns, as there is no need to escape backslashes. As with double quote
strings they allow to interpolate variables prefixed with a $ character.
GROOVY
1 x = /tic\tac\toe/
2 y = 'tic\tac\toe'
3
4 println x
5 println y
it prints:
tic\tac\toe
tic ac oe
GROOVY
1 text = """
2 Hello there James
3 how are you today?
4 """
Finally multi-line strings can also be defined with slashy string. For example:
GROOVY
1 text = /
2 This is a multi-line
3 slashy string!
4 It's cool, isn't it?!
5 /
https://seqera.io/training/#_channels 34/71
10/9/2020 Nextflow training
Like before, multi-line strings inside double quotes and slash characters support variable
interpolation, while single-quoted multi-line strings do not.
6.8. If statement
The if statement uses the same syntax common other programming lang such Java, C, JavaScript, etc.
GROOVY
1 if( < boolean expression > ) {
2 // true branch
3 }
4 else {
5 // false branch
6 }
The else branch is optional. Also curly brackets are optional when the branch define just a single statement.
GROOVY
1 x = 1
2 if( x > 10 )
3 println 'Hello'
GROOVY
1 list = [1,2,3]
2 if( list != null && list.size() > 0 ) {
3 println list
4 }
5 else {
6 println 'The list is empty'
7 }
GROOVY
1 if( list )
2 println list
3 else
4 println 'The list is empty'
In some cases can be useful to replace if statement with a ternary expression aka conditional
expression. For example:
GROOVY
1 println list ? list : 'The list is empty'
The previous statement can be further simplified using the Elvis operator
(http://groovy-lang.org/operators.html#_elvis_operator) as shown below:
GROOVY
1 println list ?: 'The list is empty'
https://seqera.io/training/#_channels 35/71
10/9/2020 Nextflow training
GROOVY
1 for (int i = 0; i <3; i++) {
2 println("Hello World $i")
3 }
Iteration over list objects is also possible using the syntax below:
GROOVY
1 list = ['a','b','c']
2
3 for( String elem : list ) {
4 println elem
5 }
6.10. Functions
It is possible to define a custom function into a script, as shown here:
GROOVY
1 int fib(int n) {
2 return n < 2 ? 1 : fib(n-1) + fib(n-2)
3 }
4
5 assert fib(10)==89
A function can take multiple arguments separating them with a comma. The return keyword can be omitted and the
function implicitly returns the value of the last evaluated expression. Also explicit types can be omitted (thought not
recommended):
GROOVY
1 def fact( n ) {
2 n > 1 ? n * fact(n-1) : 1
3 }
4
5 assert fact(5) == 120
6.11. Closures
Closures are the swiss army knife of Nextflow/Groovy programming. In a nutshell a closure is is a block of code that
can be passed as an argument to a function, it could also be defined an anonymous function.
More formally, a closure allows the definition of functions as first class objects.
GROOVY
1 square = { it * it }
The curly brackets around the expression it * it tells the script interpreter to treat this expression as code. The it
identifier is an implicit variable that represents the value that is passed to the function when it is invoked.
Once compiled the function object is assigned to the variable square as any other variable assignments shown
previously. To invoke the closure execution use the special method call or just use the round parentheses to specify
the closure parameter(s). For example:
GROOVY
1 assert square.call(5) == 25
2 assert square(9) == 81
https://seqera.io/training/#_channels 36/71
10/9/2020 Nextflow training
This is not very interesting until we find that we can pass the function square as an argument to other functions or
methods. Some built-in functions take a function like this as an argument. One example is the collect method on
lists:
GROOVY
1 x = [ 1, 2, 3, 4 ].collect(square)
2 println x
It prints:
[ 1, 4, 9, 16 ]
By default, closures take a single parameter called it , to give it a different name use the -> syntax. For example:
GROOVY
1 square = { num -> num * num }
For example, the method each() when applied to a map can take a closure with two arguments, to which it passes the
key-value pair for each entry in the map object. For example:
GROOVY
1 printMap = { a, b -> println "$a with value $b" }
2 values = [ "Yue" : "Wu", "Mark" : "Williams", "Sudha" : "Kumari" ]
3 values.each(printMap)
It prints:
A closure has two other important features. First, it can access and modify variables in the scope where it is defined.
Second, a closure can be defined in an anonymous manner, meaning that it is not given a name, and is defined in the
place where it needs to be used.
As an example showing both these features, see the following code fragment:
GROOVY
1 result = 0 1
4 println result
https://seqera.io/training/#_channels 37/71
10/9/2020 Nextflow training
3. Performs quantification.
NEXTFLOW
1 params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
2 params.transcriptome = "$baseDir/data/ggal/transcriptome.fa"
3 params.multiqc = "$baseDir/multiqc"
4
5 println "reads: $params.reads"
7.1.1. Exercise
Modify the script1.nf adding a fourth parameter named outdir and set it to a default path that will be used as the
pipeline output directory.
7.1.2. Exercise
Modify the script1.nf to print all the pipeline parameters by using a single log.info command and a multiline
string (https://www.nextflow.io/docs/latest/script.html#multi-line-strings) statement.
7.1.3. Recap
In this step you have learned:
https://seqera.io/training/#_channels 38/71
10/9/2020 Nextflow training
5. How to use log.info to print information and save it in the log execution file
NEXTFLOW
1 /*
2 * pipeline input parameters
3 */
4 params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
5 params.transcriptome = "$baseDir/data/ggal/transcriptome.fa"
6 params.multiqc = "$baseDir/multiqc"
7 params.outdir = "results"
8
9 println """\
10 R N A S E Q - N F P I P E L I N E
11 ===================================
12 transcriptome: ${params.transcriptome}
13 reads : ${params.reads}
14 outdir : ${params.outdir}
15 """
16 .stripIndent()
17
18
19 /*
20 * define the `index` process that create a binary index
21 * given the transcriptome file
22 */
23 process index {
24
25 input:
26 path transcriptome from params.transcriptome
27
28 output:
29 path 'index' into index_ch
30
31 script:
32 """
33 salmon index --threads $task.cpus -t $transcriptome -i index
34 """
35 }
It takes the transcriptome params file as input and creates the transcriptome index by using the salmon tool.
Note how the input declaration defines a transcriptome variable in the process context that it is used in the
command script to reference that file in the Salmon command line.
The execution will fail because Salmon is not installed in your environment.
https://seqera.io/training/#_channels 39/71
10/9/2020 Nextflow training
Add the command line option -with-docker to launch the execution through a Docker container as shown below:
This time it works because it uses the Docker container nextflow/rnaseq-nf defined in the nextflow.config file.
In order to avoid to add the option -with-docker add the following line in the nextflow.config file:
docker.enabled = true
7.2.1. Exercise
Enable the Docker execution by default adding the above setting in the nextflow.config file.
7.2.2. Exercise
Print the output of the index_ch channel by using the view (https://www.nextflow.io/docs/latest/operator.html#view).
7.2.3. Exercise
Use the command tree work to see how Nextflow organizes the process work directory.
7.2.4. Recap
In this step you have learned:
Edit the script script3.nf and add the following statement as the last line:
read_pairs_ch.view()
The above example shows how the read_pairs_ch channel emits tuples composed by two elements, where the first is
the read pair prefix and the second is a list representing the actual files.
https://seqera.io/training/#_channels 40/71
10/9/2020 Nextflow training
File paths including one or more wildcards ie. * , ? , etc. MUST be wrapped in single-quoted
characters to avoid Bash expands the glob.
7.3.1. Exercise
Use the set (https://www.nextflow.io/docs/latest/operator.html#set) operator in place of = assignment to define the
read_pairs_ch channel.
7.3.2. Exercise
Use the checkIfExists option for the fromFilePairs (https://www.nextflow.io/docs/latest/channel.html#fromfilepairs) method
to check if the specified path contains at least file pairs.
7.3.3. Recap
In this step you have learned:
In this script note as the index_ch channel, declared as output in the index process, is now used as a channel in the
input section.
Also note as the second input is declared as a tuple composed by two elements: the pair_id and the reads in order
to match the structure of the items emitted by the read_pairs_ch channel.
The -resume option cause the execution of any step that has been already processed to be skipped.
You will notice that the quantification process is executed more than one time.
Nextflow parallelizes the execution of your pipeline simply by providing multiple input data to your script.
7.4.1. Exercise
Add a tag (https://www.nextflow.io/docs/latest/process.html#tag) directive to the quantification process to provide a more
readable execution log.
https://seqera.io/training/#_channels 41/71
10/9/2020 Nextflow training
7.4.2. Exercise
Add a publishDir (https://www.nextflow.io/docs/latest/process.html#publishdir) directive to the quantification process to
store the process results into a directory of your choice.
7.4.3. Recap
In this step you have learned:
2. How to resume the script execution skipping already already computed steps
3. How to use the tag directive to provide a more readable execution output
4. How to use the publishDir to store a process results in a path of your choice
Channel `read_pairs_ch` has been used twice as an input by process `fastqc` and process `quantification`
7.5.1. Exercise
Modify the creation of the read_pairs_ch channel by using a into (https://www.nextflow.io/docs/latest/operator.html#into)
operator in place of a set .
7.5.2. Recap
In this step you have learned:
1. How to use the into operator to create multiple copies of the same channel
It creates the final report in the results folder in the current work directory.
https://seqera.io/training/#_channels 42/71
10/9/2020 Nextflow training
In this script note the use of the mix (https://www.nextflow.io/docs/latest/operator.html#mix) and collect
(https://www.nextflow.io/docs/latest/operator.html#collect) operators chained together to get all the outputs of the
quantification and fastqc process as a single input.
7.6.1. Recap
In this step you have learned:
1. How to collect many outputs to a single input with the collect operator
Note that Nextflow processes define the execution of asynchronous tasks i.e. they are not executed one after another
as they are written in the pipeline script as it would happen in a common imperative programming language.
The script uses the workflow.onComplete event handler to print a confirmation message when the script completes.
7.8. Bonus!
Send a notification email when the workflow execution complete using the -N <email address> command line
option. Note: this requires the configuration of a SMTP server in nextflow config file. For the sake of this tutorial add
the following setting in your nextflow.config file:
CONFIG
1 mail {
2 from = 'info@nextflow.io'
3 smtp.host = 'email-smtp.eu-west-1.amazonaws.com'
4 smtp.port = 587
5 smtp.user = "xxxxx"
6 smtp.password = "yyyyy"
7 smtp.auth = true
8 smtp.starttls.enable = true
9 smtp.starttls.required = true
10 }
Then execute again the previous example specifying your email address:
For example, create a file named fastqc.sh with the following content:
https://seqera.io/training/#_channels 43/71
10/9/2020 Nextflow training
BASH
1 #!/bin/bash
2 set -e
3 set -u
4
5 sample_id=${1}
6 reads=${2}
7
8 mkdir fastqc_${sample_id}_logs
9 fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}
Save it, give execute permission and move it in the bin directory as shown below:
BASH
1 chmod +x fastqc.sh
2 mkdir -p bin
3 mv fastqc.sh bin
Then, open the script7.nf file and replace the fastqc process' script with the following code:
NEXTFLOW
1 script:
2 """
3 fastqc.sh "$sample_id" "$reads"
4 """
Run it as before:
7.9.1. Recap
In this step you have learned:
2. How to avoid the use of absolute paths having your scripts in the bin/ project folder.
The -with-report option enables the creation of the workflow execution report. Open the file report.html with a
browser to see the report created with the above command.
The -with-trace option enables the create of a tab separated file containing runtime information for each executed
task. Check the content of the file trace.txt for an example.
The -with-timeline option enables the creation of the workflow timeline report showing how processes where
executed along time. This may be useful to identify most time consuming tasks and bottlenecks. See an example at this
link (https://www.nextflow.io/docs/latest/tracing.html#timeline-report).
https://seqera.io/training/#_channels 44/71
10/9/2020 Nextflow training
Finally the -with-dag option enables to rendering of the workflow execution direct acyclic graph representation.
Note: this feature requires the installation of Graphviz (http://www.graphviz.org/) in your computer. See here
(https://www.nextflow.io/docs/latest/tracing.html#dag-visualisation) for details.
Note: runtime metrics may be incomplete for run short running tasks as in the case of this tutorial.
You view the HTML files right-clicking on the file name in the left side-bar and choosing the Preview
menu item.
This simplifies the sharing and the deployment of complex projects and tracking changes in a consistent manner.
The following GitHub repository hosts a complete version of the workflow introduced in this tutorial:
github.com/nextflow-io/rnaseq-nf
Nextflow allows the execution of a specific revision of your project by using the -r command line option. For
Example:
Revision are defined by using Git tags or branches defined in the project repository.
This allows a precise control of the changes in your project files and dependencies over time.
https://seqera.io/training/#_channels 45/71
10/9/2020 Nextflow training
Installing and maintaining such dependencies is a challenging task and the most common source of irreproducibility in
scientific applications.
Containers are exceptionally useful in scientific workflows. They allow the encapsulation of software dependencies, i.e.
tools and libraries required by a data analysis application in one or more self-contained, ready-to-run, immutable
container images that can be easily deployed in any platform supporting the container runtime.
A container is a ready-to-run Linux environment which can be executed in an isolated manner from the hosting
system. It has own copy of the file system, processes space, memory management, etc.
Containers are a Linux feature known as Control Groups or Cgroups (https://en.wikipedia.org/wiki/Cgroups) introduced with
kernel 2.6.
Docker adds to this concept an handy management tool to build, run and share container images.
These images can be uploaded and published in a centralised repository know as Docker Hub (https://hub.docker.com), or
hosted by other parties like for example Quay (https://quay.io).
BASH
docker run <container-name>
For example:
BASH
docker run hello-world
BASH
docker pull debian:stretch-slim
BASH
docker run -it debian:stretch-slim bash
Once launched the container you wil noticed that’s running as root (!). Use the usual commands to navigate in the file
system.
To exit from the container, stop the BASH session with the exit command.
https://seqera.io/training/#_channels 46/71
10/9/2020 Nextflow training
Docker images are created by using a so called Dockerfile i.e. a simple text file containing a list of commands to be
executed to assemble and configure the image with the software packages required.
In this step you will create a Docker image containing the Salmon tool.
Warning: the Docker build process automatically copies all files that are located in the current directory to the Docker
daemon in order to create the image. This can take a lot of time when big/many files exist. For this reason it’s important
to always work in a directory containing only the files you really need to include in your Docker image. Alternatively
you can use the .dockerignore file to select the path to exclude from the build.
Then use your favourite editor eg. vim to create a file named Dockerfile and copy the following content:
DOCKER
1 FROM debian:stretch-slim
2
3 MAINTAINER <your name>
4
5 RUN apt-get update && apt-get install -y curl cowsay
6
7 ENV PATH=$PATH:/usr/games/
BASH
docker build -t my-image .
Note: don’t miss the dot in the above command. When it completes, verify that the image has been created listing all
available images:
BASH
docker images
BASH
docker run my-image cowsay Hello Docker!
DOCKER
1 RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.g
2 && mv /salmon-*/bin/* /usr/bin/ \
3 && mv /salmon-*/lib/* /usr/lib/
Save the file and build again the image with the same command as before:
BASH
docker build -t my-image .
You will notice that it creates a new Docker image with the same name but with a different image ID.
https://seqera.io/training/#_channels 47/71
10/9/2020 Nextflow training
Check that everything is fine running Salmon in the container as shown below:
BASH
docker run my-image salmon --version
You can even launch a container in an interactive mode by using the following command:
BASH
docker run -it my-image bash
BASH
docker run my-image \
salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index
The above command fails because Salmon cannot access the input file.
This happens because the container runs in a complete separate file system and it cannot access the hosting file system
by default.
You will need to use the --volume command line option to mount the input file(s) eg.
BASH
docker run --volume $PWD/data/ggal/transcriptome.fa:/transcriptome.fa my-image \
salmon index -t /transcriptome.fa -i transcript-index
the generated transcript-index directory is still not accessible in the host file system (and
actually it went lost).
An easier way is to mount a parent directory to an identical one in the container, this allows you to
use the same path when running it in the container eg.
BASH
docker run --volume $HOME:$HOME --workdir $PWD my-image \
salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index
BASH
ls -la transcript-index
Note that the permissions for files created by the Docker execution is root .
Exercise
Use the option -u $(id -u):$(id -g) to allow Docker to create files with the right permission.
https://seqera.io/training/#_channels 48/71
10/9/2020 Nextflow training
Publish your container in the Docker Hub to share it with other people.
Create an account in the hub.docker.com web site. Then from your shell terminal run the following command, entering
the user name and password you specified registering in the Hub:
BASH
docker login
BASH
docker tag my-image <user-name>/my-image
BASH
docker push <user-name>/my-image
BASH
docker pull <user-name>/my-image
Note how after a pull and push operation, Docker prints the container digest number e.g.
BASH
Digest: sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
Status: Downloaded newer image for nextflow/rnaseq-nf:latest
This is a unique and immutable identifier that can be used to reference container image in a univocally manner. For
example:
BASH
docker pull nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
We’ll see later how to configure in the Nextflow config file which container to use instead of having to specify every
time as a command line argument.
8.2. Singularity
Singularity (http://singularity.lbl.gov) is container runtime designed to work in HPC data center, where the usage of Docker
is generally not allowed due to security constraints.
Singularity implements a container execution model similarly to Docker however it uses a complete different
implementation design.
A Singularity container image is archived as a plain file that can be stored in a shared file system and accessed by
many computing nodes managed by a batch scheduler.
https://seqera.io/training/#_channels 49/71
10/9/2020 Nextflow training
Singularity images are created using a Singularity file in similar manner to Docker, though using a different syntax.
SINGULARITY
1 Bootstrap: docker
2 From: debian:stretch-slim
3
4 %environment
5 export PATH=$PATH:/usr/games/
6
7 %labels
8 AUTHOR <your name>
9
10 %post
11
12 apt-get update && apt-get install -y locales-all curl cowsay
13 curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.gz |
14 && mv /salmon-*/bin/* /usr/bin/ \
15 && mv /salmon-*/lib/* /usr/lib/
Once you have save the Singularity file. Create the image with these commands:
BASH
sudo singularity build my-image.sif Singularity
Note: the build command requires sudo permissions. A common workaround consists to build the image on a local
workstation and then deploy in the cluster just copying the image file.
BASH
singularity exec my-image.sif cowsay 'Hello Singularity'
By using the shell command you can enter in the container in interactive mode. For example:
BASH
singularity shell my-image.sif
BASH
touch hello.txt
ls -la
Note how the files on the host environment are shown. Singularity automatically mounts the host
$HOME directory and uses the current work directory.
BASH
singularity pull docker://debian:stretch-slim
The above command automatically download the Debian Docker image and converts it to a Singularity image store in
the current directory with the name debian-jessie.simg .
https://seqera.io/training/#_channels 50/71
10/9/2020 Nextflow training
It only requires to enable the use of Singularity engine in place of Docker in the Nextflow configuration file using the -
with-singularity command line option:
BASH
nextflow run script7.nf -with-singularity nextflow/rnaseq-nf
As before the Singularity container can also be provided in the Nextflow config file. We’ll see later how to do it.
In the same way that we can push docker images to Docker Hub, we can upload Singularity images to the Singularity
Library.
A Conda environment is defined using a YAML file which lists the required software packages. For example:
YAML
name: nf-tutorial
channels:
- defaults
- bioconda
- conda-forge
dependencies:
- salmon=1.0.0
- fastqc=0.11.5
- multiqc=1.5
Given the recipe file, the environment is created using the command shown below:
BASH
conda env create --file env.yml
You can check the environment was created successfully with the command shown below:
BASH
conda env list
BASH
conda activate nf-tutorial
Nextflow is able to manage the activation of a Conda environment when the its directory is specified using the -with-
conda option. For example:
BASH
nextflow run script7.nf -with-conda /home/ubuntu/miniconda2/envs/nf-tutorial
https://seqera.io/training/#_channels 51/71
10/9/2020 Nextflow training
When specifying as Conda environment a YAML recipe file, Nextflow automatically downloads the
required dependencies, build the environment and automatically activate it.
This makes easier to manage different environments for the processes in the workflow script.
8.4. BioContainers
Another useful resource linking together Bioconda and containers is the BioContainers (https://biocontainers.pro) project.
BioContainers is a community initiative that provides a registry of container images for every Bioconda recipe.
9. Nextflow configuration
A key Nextflow feature is the ability to decouple the workflow implementation by the configuration setting required by
the underlying execution platform.
This enable portable deployment without the need to modify the application code.
When more than one on the above files exist they are merged, so that the settings in the first override the same ones
that may appear in the second one, and so on.
The default config file search mechanism can be extended proving an extra configuration file by using the command
line option -c <config file> .
name = value
Please note, string values need to be wrapped in quotation characters while numbers and boolean
values ( true , false ) do not. Also note that values are typed, meaning for example that, 1 is
different from '1' , since the first is interpreted as the number one, while the latter is interpreted as
a string value.
https://seqera.io/training/#_channels 52/71
10/9/2020 Nextflow training
CONFIG
1 propertyOne = 'world'
2 anotherProp = "Hello $propertyOne"
3 customPath = "$PATH:/my/app/folder"
In the configuration file it’s possible to access any variable defined in the host environment such as
$PATH , $HOME , $PWD , etc.
NEXTFLOW
1 // comment a single config file
2
3 /*
4 a comment spanning
5 multiple lines
6 */
CONFIG
1 alpha.x = 1
2 alpha.y = 'string value..'
3
4 beta {
5 p = 2
6 q = 'another string ..'
7 }
CONFIG
1 // config file
2 params.foo = 'Bonjour'
3 params.bar = 'le monde!'
NEXTFLOW
1 // workflow script
2 params.foo = 'Hello'
3 params.bar = 'world!'
4
5 // print the both params
6 println "$params.foo $params.bar"
Exercise
Save the first snippet as nextflow.config and the second one as params.nf . Then run:
CMD
nextflow run params.nf
https://seqera.io/training/#_channels 53/71
10/9/2020 Nextflow training
CMD
nextflow run params.nf --foo Hola
CONFIG
1 env.ALPHA = 'some value'
2 env.BETA = "$HOME/some/path"
Exercise
Save the above snippet a file named my-env.config . The save the snippet below in a file named foo.nf :
NEXTFLOW
1 process foo {
2 echo true
3 '''
4 env | egrep 'ALPHA|BETA'
5 '''
6 }
However it’s always a good practice to decouple the workflow execution logic from the process configuration settings,
i.e. it’s strongly suggested to define the process settings in the workflow configuration file instead of the workflow
script.
The process configuration scope allows the setting of any process directives
(https://www.nextflow.io/docs/latest/process.html#directives) in the Nextflow configuration file. For example:
CONFIG
1 process {
2 cpus = 10
3 memory = 8.GB
4 container = 'biocontainers/bamtools:v2.4.0_cv3'
5 }
The above config snippet defines the cpus , memory and container directives for all processes in your workflow
script.
https://seqera.io/training/#_channels 54/71
10/9/2020 Nextflow training
Memory and time duration unit can be specified either using a string based notation in which the
digit(s) and the unit can be separated by a blank or by using the numeric notation in which the
digit(s) and the unit are separated by a dot character and it’s not enclosed by quote characters.
The syntax for setting process directives in the configuration file requires = ie. assignment operator,
instead it should not be used when setting process directives in the workflow script.
This important especially when you want to define a config setting using a dynamic expression using a closure. For
example:
process {
memory = { 4.GB * task.cpus }
}
Directives that requires more than one value, e.g. pod (https://www.nextflow.io/docs/latest/process.html#pod), in the
configuration file need to be expressed as a map object.
process {
pod = [env: 'FOO', value: '123']
}
Finally directives that allows to be repeated in the process definition, in the configuration files need to be defined as a
list object. For example:
process {
pod = [ [env: 'FOO', value: '123'],
[env: 'BAR', value: '456'] ]
}
CONFIG
1 process.container = 'nextflow/rnaseq-nf'
2 docker.enabled = true
The use of the unique SHA256 image ID guarantees that the image content do not change over time
https://seqera.io/training/#_channels 55/71
10/9/2020 Nextflow training
CONFIG
1 process.container = 'nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab
2 docker.enabled = true
CONFIG
1 process.container = '/some/singularity/image.sif'
2 singularity.enabled = true
The container image file must be an absolute path i.e. it must start with a / .
library:// download the container image from the Singularity Library service (https://cloud.sylabs.io/library).
shub:// download the container image from the Singularity Hub (https://singularity-hub.org/).
docker:// download the container image from the Docker Hub (https://hub.docker.com/) and convert it to the
Singularity format.
docker-daemon:// pull the container image from a local Docker installation and convert it to a Singularity image
file.
Specifying a plain Docker container image name, Nextflow implicitly download and converts it to a
Singularity image when the Singularity execution is enabled. For example:
CONFIG
1 process.container = 'nextflow/rnaseq-nf'
2 singularity.enabled = true
The above configuration instructs Nextflow to use Singularity engine to run your script processes. The container is
pulled from the Docker registry and cached in the current directory to be used for further runs.
Alternatively if you have a Singularity image file, its location absolute path can be specified as the container name
either using the -with-singularity option or the process.container setting in the config file.
BASH
nextflow run script7.nf
Note: Nextflow will pull the container image automatically, it will require a few seconds depending the network
connection speed.
CONFIG
1 process.conda = "/home/ubuntu/miniconda2/envs/nf-tutorial"
https://seqera.io/training/#_channels 56/71
10/9/2020 Nextflow training
You can either specify the path of an existing Conda environment directory or the path of Conda environment YAML
file.
Nextflow has built-in support for most common used batch schedulers such as Univa Grid Engine and SLURM
(https://slurm.schedmd.com/) and IBM LSF between the other. Check the Nextflow documentation for the complete list of
supported execution platforms (https://www.nextflow.io/docs/latest/executor.html).
To run your pipeline with a batch scheduler modify the nextflow.config file specifying the target executor and the
required computing resources if needed. For example:
CONFIG
1 process.executor = 'slurm'
https://seqera.io/training/#_channels 57/71
10/9/2020 Nextflow training
When using a batch scheduler is generally needed to specify the amount of resources i.e. cpus, memory, execution time,
etc. required by each task.
CONFIG
1 process {
2 executor = 'slurm'
3 queue = 'short'
4 memory = '10 GB'
5 time = '30 min'
6 cpus = 4
7 }
CONFIG
1 process {
2 executor = 'slurm'
3 queue = 'short'
4 memory = '10 GB'
5 time = '30 min'
6 cpus = 4
7
8 withName: foo {
9 cpus = 4
10 memory = '20 GB'
11 queue = 'short'
12 }
13
14 withName: bar {
15 cpus = 8
16 memory = '32 GB'
17 queue = 'long'
18 }
19 }
https://seqera.io/training/#_channels 58/71
10/9/2020 Nextflow training
When a workflow application is composed by many processes can be overkill listing all process names in the
configuration file to specifies the resources for each of them.
NEXTFLOW
1 process task1 {
2 label 'long'
3
4 """
5 first_command --here
6 """
7 }
8
9 process task2 {
10 label 'short'
11
12 """
13 second_command --here
14 """
15 }
16
CONFIG
1 process {
2 executor = 'slurm'
3
4 withLabel: 'short' {
5 cpus = 4
6 memory = '20 GB'
7 queue = 'alpha'
8 }
9
10 withLabel: 'long' {
11 cpus = 8
12 memory = '32 GB'
13 queue = 'omega'
14 }
15 }
CONFIG
1 process {
2 withName: foo {
3 container = 'some/image:x'
4 }
5 withName: bar {
6 container = 'other/image:y'
7 }
8 }
9
10 docker.enabled = true
https://seqera.io/training/#_channels 59/71
10/9/2020 Nextflow training
A single fat container or many slim containers? Both approaches have pros & cons. A single
container is simpler to build and to maintain, however when using many tools the image can
become very big and tools can conflict each other. Using a container for each process can result in
many different images to build and to maintain, especially when processes in your workflow uses
different tools in each task.
Configuration profiles are defined by using the special scope profiles which group the attributes that belong to the
same profile using a common prefix. For example:
CONFIG
1 profiles {
2
3 standard {
4 params.genome = '/local/path/ref.fasta'
5 process.executor = 'local'
6 }
7
8 cluster {
9 params.genome = '/data/stared/ref.fasta'
10 process.executor = 'sge'
11 process.queue = 'long'
12 process.memory = '10GB'
13 process.conda = '/some/path/env.yml'
14 }
15
16 cloud {
17 params.genome = '/data/stared/ref.fasta'
18 process.executor = 'awsbatch'
19 process.container = 'cbcrg/imagex'
20 docker.enabled = true
21 }
22
23 }
This configuration defines three different profiles: standard , cluster and cloud that set different process
configuration strategies depending on the target runtime platform. By convention the standard profile is implicitly
used when no other profile is specified by the user.
To enable a specific profile use -profile option followed by the profile name:
CMD
nextflow run <your script> -profile cluster
Two or more configuration profiles can be specified by separating the profile names with a comma
character:
CMD
nextflow run <your script> -profile standard,cloud
https://seqera.io/training/#_channels 60/71
10/9/2020 Nextflow training
AWS Batch (https://aws.amazon.com/batch/) is a managed computing service that allows the execution of containerised
workloads in the Amazon cloud infrastructure.
Nextflow provides a built-in support for AWS Batch which allows the seamless deployment of a Nextflow pipeline in
the cloud offloading the process executions as Batch jobs.
Once the Batch environment is configured specifying the instance types to be used and the max number of cpus to be
allocated, you need to created a Nextflow configuration file like the one showed below:
CONFIG
1 process.executor = 'awsbatch' 1
2 process.queue = 'nextflow-ci' 2
3 process.container = 'nextflow/rnaseq-nf:latest' 3
4 workDir = 's3://nextflow-ci/work/' 4
5 aws.region = 'eu-west-1' 5
6 aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws' 6
1 Set the AWS Batch as the executor to run the processes in the workflow
2 The name of the computing queue defined in the Batch environment
3 The Docker container image to be used to run each job
4 The workflow work directory must be a AWS S3 bucket
5 The AWS region to be used
6 The path of the AWS cli tool required to download/upload files to/from the container
The best practices is to keep this setting as a separate profile in your workflow config file. This allows
the execution with a simple command.
The complete details about AWS Batch deployment are available at this link
(https://www.nextflow.io/docs/latest/awscloud.html#aws-batch).
aws {
batch {
volumes = '/some/path'
}
}
Multiple volumes can be specified using comma-separated paths. The usual Docker volume mount syntax can be used
to define complex volumes for which the container paths is different from the host paths or to specify a read-only
option:
aws {
region = 'eu-west-1'
batch {
volumes = ['/tmp', '/host/path:/mnt/path:ro']
}
}
https://seqera.io/training/#_channels 61/71
10/9/2020 Nextflow training
IMPORTANT:
This a global configuration that has to be specified in a Nextflow config file, as such it’s applied to all process
executions.
Nextflow expects those paths to be available. It does not handle the provision of EBS volumes or other kind of
storage.
However, you may still need to specify a custom Job Definition to provide fine-grained control of the configuration
settings of a specific job e.g. to define custom mount paths or other special settings of a Batch Job.
To use your own job definition in a Nextflow workflow, use it in place of the container image name, prefixing it with
the job-definition:// string. For example:
process {
container = 'job-definition://your-job-definition-name'
}
When creating your custom AMI for AWS Batch, make sure to use the Amazon ECS-Optimized
Amazon Linux AMI as the base image.
The following snippet shows how to install AWS CLI with Miniconda:
The aws tool will be placed in a directory named bin in the main installation folder. Modifying this
directory structure, after the installation, this will cause the tool not to work properly.
Finally specify the aws full path in the Nextflow config file as show below:
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
https://seqera.io/training/#_channels 62/71
10/9/2020 Nextflow training
In the EC2 dashboard create a Launch template specifying in the user data field:
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"
--//
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/sh
## install required deps
set -x
export PATH=/usr/local/bin:$PATH
yum install -y jq python27-pip sed wget bzip2
pip install -U boto3
## install awscli
USER=/home/ec2-user
wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $USER/miniconda
$USER/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh
chown -R ec2-user:ec2-user $USER/miniconda
--//--
Then in the Batch dashboard create a new compute environment and specify the newly created launch template in the
corresponding field.
CONFIG
1 process {
2 executor = 'slurm' 1
3 queue = 'short' 2
4
5 withLabel: bigTask { 3
6 executor = 'awsbatch' 4
7 queue = 'my-batch-queue' 5
8 container = 'my/image:tag' 6
9 }
10 }
11
12 aws {
13 region = 'eu-west-1' 7
14 }
https://seqera.io/training/#_channels 63/71
10/9/2020 Nextflow training
The task unique ID is generated as a 128-bit hash number obtained composing the task inputs values and files and the
command string.
work/
├── 12
│ └── 1adacb582d2198cd32db0e6f808bce
│ ├── genome.fa -> /data/../genome.fa
│ └── index
│ ├── hash.bin
│ ├── header.json
│ ├── indexing.log
│ ├── quasi_index.log
│ ├── refInfo.json
│ ├── rsd.bin
│ ├── sa.bin
│ ├── txpInfo.bin
│ └── versionInfo.json
├── 19
│ └── 663679d1d87bfeafacf30c1deaf81b
│ ├── ggal_gut
│ │ ├── aux_info
│ │ │ ├── ambig_info.tsv
│ │ │ ├── expected_bias.gz
│ │ │ ├── fld.gz
│ │ │ ├── meta_info.json
│ │ │ ├── observed_bias.gz
│ │ │ └── observed_bias_3p.gz
│ │ ├── cmd_info.json
│ │ ├── libParams
│ │ │ └── flenDist.txt
│ │ ├── lib_format_counts.json
│ │ ├── logs
│ │ │ └── salmon_quant.log
│ │ └── quant.sf
│ ├── ggal_gut_1.fq -> /data/../ggal_gut_1.fq
│ ├── ggal_gut_2.fq -> /data/../ggal_gut_2.fq
│ └── index -> /data/../asciidocs/day2/work/12/1adacb582d2198cd32db0e6f808bce/index
https://seqera.io/training/#_channels 64/71
10/9/2020 Nextflow training
In practical terms the pipeline is executed from the beginning however before launching the execution of a process.
Nextflow uses the task unique ID to check if the work directory already exists and it contains a valid command exit
status and the expected output files.
If this condition is satisfied the task execution is skipped and previously computed results are used as the process
results.
The first task, for which a new output is computed, invalidates all downstream executions in the remaining DAG.
Workflow final output are supposed to the stored in a different location specified using one or more
publishDir (https://www.nextflow.io/docs/latest/process.html#publishdir) directive.
A different location for the execution work directory can be specified using the command line option -w e.g.
If you delete or more the pipeline work directory will prevent to use the resume feature in following
runs.
Therefore just touching a file will invalidated the related task execution.
Note that in the same experiment the same pipeline can be executed multiple times, however it should be avoided to
launch two (or more) Nextflow instances in the same directory concurrently.
The nextflow log command lists the executions run in the current folder:
BASH
1 $ nextflow log
2
3 TIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID
4 2019-05-06 12:07:32 1.2s focused_carson ERR a9012339ce 7363b3f0-09ac-495b-a947-28cf430d0b85
5 2019-05-06 12:08:33 21.1s mighty_boyd OK a9012339ce 7363b3f0-09ac-495b-a947-28cf430d0b85
6 2019-05-06 12:31:15 1.2s insane_celsius ERR b9aefc67b4 4dc656d2-c410-44c8-bc32-7dd0ea87bebf
7 2019-05-06 12:31:24 17s stupefied_euclid OK b9aefc67b4 4dc656d2-c410-44c8-bc32-7dd0ea87bebf
You can use either the session ID or the run name to recover a specific execution. For example:
https://seqera.io/training/#_channels 65/71
10/9/2020 Nextflow training
By default, it lists the work directories used to compute each task. For example:
/data/.../work/7b/3753ff13b1fa5348d2d9b6f512153a
/data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c
/data/.../work/f7/659c65ef60582d9713252bcfbcc310
/data/.../work/82/ba67e3175bd9e6479d4310e5a92f99
/data/.../work/e5/2816b9d4e7b402bfdd6597c2c2403d
/data/.../work/3b/3485d00b0115f89e4c202eacf82eba
Using the option -f (fields) it’s possible to specify which metadata should be printed by the log command. For
example:
The complete list of available fields can be retrieved with the command:
nextflow log -l
The option -F allows the specification of a filtering criteria to print only a subset of tasks. For example:
/data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c
/data/.../work/f7/659c65ef60582d9713252bcfbcc310
Finally, the -t option allow the creation of a basic custom provenance report proving a template file, in any format of
your choice. For example:
https://seqera.io/training/#_channels 66/71
10/9/2020 Nextflow training
HTML
<div>
<h2>${name}</h2>
<div>
Script:
<pre>${script}</pre>
</div>
<ul>
<li>Exit: ${exit}</li>
<li>Status: ${status}</li>
<li>Work dir: ${workdir}</li>
<li>Container: ${container}</li>
</ul>
</div>
Save the above snippet in a file named template.html . Then run this command:
Input file changed: Make sure that there’s no change in your input files. Don’t forget task unique hash is computed
taking into account the complete file path, the last modified timestamp and the file size. If any of these information
changes, the workflow will be re-executed even if the input content is the same.
A process modifies an input: A process should never alter input files otherwise the resume, for future executions,
will be invalidated for the same reason explained in the previous point.
Inconsistent file attributes: Some shared file system, such as NFS (https://en.wikipedia.org/wiki/Network_File_System),
may report inconsistent file timestamp i.e. a different timestamp for the same file even if it has not be modified. To
prevent this problem use the lenient cache strategy (https://www.nextflow.io/docs/latest/process.html#cache).
Race condition in global variable: Nextflow is designed to simplify parallel programming without taking care
about race conditions and the access to shared resources. One of the few cases in which a race condition can arise is
when using a global variable with two (or more) operators. For example:
NEXTFLOW
1 Channel
2 .from(1,2,3)
3 .map { it -> X=it; X+=2 }
4 .view { "ch1 = $it" }
5
6 Channel
7 .from(1,2,3)
8 .map { it -> X=it; X*=2 }
9 .view { "ch2 = $it" }
The problem in this snippet is that the X variable in the closure definition is defined in the global scope. Therefore,
since operators are executed in parallel, the X value can be overwritten by the other map invocation.
The correct implementation requires the use of the def keyword to declare the variable local.
https://seqera.io/training/#_channels 67/71
10/9/2020 Nextflow training
NEXTFLOW
1 Channel
2 .from(1,2,3)
3 .map { it -> def X=it; X+=2 }
4 .println { "ch1 = $it" }
5
6 Channel
7 .from(1,2,3)
8 .map { it -> def X=it; X*=2 }
9 .println { "ch2 = $it" }
Not deterministic input channels: While dataflow channel ordering is guaranteed i.e. data is read in the same
order in which it’s written in the channel, when a process declares as input two or more channel each of which is
the output of a different process the overall input ordering is not consistent over different executions.
The inputs declared at line 19,20 can be delivered in any order because the execution order of the process foo and
bar is not deterministic due to the parallel executions of them.
Therefore the input of the third process needs to be synchronized using the join
(https://www.nextflow.io/docs/latest/operator.html#join) operator or a similar approach. The third process should be
written as:
NEXTFLOW
1 ...
2
3 process gather {
4 input:
5 set val(pair), file(bam), file(bai) from bam_ch.join(bai_ch)
6 """
7 merge_command $bam $bai
8 """
9 }
When a process execution exit with a non-zero exit status, Nextflow stops the workflow execution and report the
failing task:
CMD
ERROR ~ Error executing process > 'index'
Caused by: 1
Command executed: 2
127
Command output: 4
(empty)
Command error: 5
Work dir: 6
/Users/pditommaso/work/0b/b59f362980defd7376ee0a75b41f62
Review carefully all these data, they can provide valuable information on the cause of the error.
If this is not enough, change in the task work directory. It contains all the files to replicate the issue in a isolated
manner.
Verify that the .command.sh file contains the expected command to be executed and all variables are correctly
resolved.
https://seqera.io/training/#_channels 69/71
10/9/2020 Nextflow training
Also verify the existence of the file .exitcode file. If missing and also the file .command.begin does not exist, the task
was never executed by the subsystem (eg. the batch scheduler). If the .command.begin file exist, the job was launched
but the it likely was abruptly killed.
You can replicate the failing execution using the command bash .command.run and verify the cause of the error.
NEXTFLOW
1 process foo {
2 errorStrategy 'ignore'
3 script:
4 """
5 your_command --this --that
6 """
7 }
If you want to ignore any error set the same directive in the config file as default setting:
CONFIG
1 process.errorStrategy = 'ignore'
NEXTFLOW
1 process foo {
2 errorStrategy 'retry'
3 script:
4 """
5 your_command --this --that
6 """
7 }
Using the retry error strategy the task is re-executed a second time if it returns a non-zero exit status before stopping
the complete workflow execution.
https://seqera.io/training/#_channels 70/71
10/9/2020 Nextflow training
NEXTFLOW
1 process foo {
2 errorStrategy { sleep(Math.pow(2, task.attempt) * 200 as long); return 'retry' }
3 maxRetries 5
4 script:
5 '''
6 your_command --here
7 '''
8 }
To handle this use case you can use a retry error strategy and increasing the computing resources allocated by the
job at each successive attempt.
NEXTFLOW
1 process foo {
2 cpus 4
3 memory { 2.GB * task.attempt } 1
6 maxRetries 3 4
7
8 script:
9 """
10 your_command --cpus $task.cpus --mem $task.memory
11 """
12 }
1 The memory is defined in a dynamic manner, the first attempt is 2 GB, the second 4 GB, and so on.
2 The wall execution time is set dynamically as well, the first execution attempt is set to 1 hour, the second 2
hours, and so on.
3 If the task return an exit status equals to 140 sets the error strategy to retry otherwise terminates the
execution.
4 It can retry the process execution up to three times.
https://seqera.io/training/#_channels 71/71