Art of Scripting in Bioinformatics
Art of Scripting in Bioinformatics
Art of Scripting in Bioinformatics
The Art of
Bioinformatics
Scripting
1 Welcome to scripting 7
1.1 The Biostar Handbook Collection . . . . . . . . . . . . . . . . 7
1.2 High resolution images . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Is there an “art” to scripting? . . . . . . . . . . . . . . . . . . 8
1.4 How to download the book? . . . . . . . . . . . . . . . . . . . 9
1.5 Typesetting conventions . . . . . . . . . . . . . . . . . . . . . 10
1.6 How was the book developed? . . . . . . . . . . . . . . . . . . 10
2
CONTENTS 3
II LEARN SCRIPTING 41
7 Annoying problems 49
7.1 Learn to visualize “whitespace” . . . . . . . . . . . . . . . . . 49
7.2 Editor requirements . . . . . . . . . . . . . . . . . . . . . . . 51
7.3 Make sure the script is in Unix format! . . . . . . . . . . . . . 51
7.4 Use Notepad++ or EditPlus on Windows . . . . . . . . . . . 52
4 CONTENTS
11 Terminal multiplexing 70
11.1 How do I use terminal multiplexing? . . . . . . . . . . . . . . 70
11.2 What does a terminal multiplexer do? . . . . . . . . . . . . . 70
11.3 How to create multiple windows in a session? . . . . . . . . . 74
11.4 Minimalist tmux survival guide . . . . . . . . . . . . . . . . . 74
11.5 Where can I learn more? . . . . . . . . . . . . . . . . . . . . . 76
IV ADVANCED CONCEPTS 77
12 Advanced shell concepts 79
12.1 Self documenting bash programs . . . . . . . . . . . . . . . . 79
12.2 Better output redirection . . . . . . . . . . . . . . . . . . . . . 80
12.3 Dealing with the copious output of tools . . . . . . . . . . . . 81
12.4 What does matching pattern do? . . . . . . . . . . . . . . . . 82
12.5 How can I manipulate file paths in bash? . . . . . . . . . . . . 82
CONTENTS 5
V APPENDIX 95
14 Appendix Notes 97
Welcome to scripting
7
8 CHAPTER 1. WELCOME TO SCRIPTING
For a beginner all automated processes look like magic. Yet, just like magic,
when practiced haphazardly, these programs can blow into our faces - some-
times in the most devious manners:
• A Code Glitch May Have Caused Errors In More Than 100 Published
Studies9
In the story above, the critical error was to rely on the sort order of files that
the operating system produces when we ask it to list files in a directory. This
flawed approach, of default sorting with pattern matching is widespread in
bioinformatics as well; an unreliable choice of code might look like this:
# This is a counterexample! Do not use it!
for name in *.fq; do
echo "runtool $name > $name.txt"
done
The code is suboptimal because it runs on whatever files match the pattern.
This behaviour seems convenient at first, perhaps involves less typing, yet
constructs designed like so frequently lead to ever increasing complications
and unexpected problems.
A more robust approach discussed in the chapter How to write better loops
explains how to do it better.
12
https://www.biostars.org
Part I
11
Chapter 2
A bash script is a plain text file executed via the bash shell variant (dialect).
Here is what a script might look like in your editor.
1
https://en.wikipedia.org/wiki/Shell_script
13
14 CHAPTER 2. WHAT ARE SHELL SCRIPTS
Bash scripts are usually relatively short text files (10-100 lines) that contain
a series of commands that you could also type into the terminal. Conven-
tionally the shell file extension is .sh (naming it such will allow your editor
to color it as seen above) and is run through the shell either with :
bash myscript.sh
or, even as:
myscript.sh
Recommended editors:
• Notepad++2 for Windows (recommended editor on Windows)
• Komodo Edit for Mac, Windows, Linux (used to be good but it is
getting too complicated)
• Sublime Text for Mac, Windows, Linux
Programming editors:
• PyCharm for Mac, Windows Linux (recommended for Python pro-
grammers, I use this tool for both Python programming and text edit-
ing)
• Visual Studio Code for Mac, Windows Linux (excellent for coding
as well)
In addition, your editor will not recognize the program as a script; it won’t
syntax highlight it. Keep the extension on your scripts!
In the book we use the explicit notation of executing bash scripts with:
bash myscript.sh
We recommend that you do that too.
to press tab, right?). Never write out fully any filename from the keyboard,
always autocomplete! This action will ensure that you are using the correct
names. If you are not tabbing regularly6 , get into the habit of doing so!
Within a script, TABs will not autocomplete.
One particularly annoying interactive behavior is the new meaning of the !
character. When you type ! into an interactive bash session, it is interpreted
as an operator that can rerun a previous command as a match. It has many
behaviors, for example, if you first typed
foo
bar
then later you could type:
!f
to rerun foo. This is called event designation7 .
It is an outdated “feature” that will allow you to run commands without
showing you what it will do. Don’t use it. It goes against the principles of
being explicit and in full command. If you need a previous command, press
the up arrow key to scroll through them.
The most annoying side effect of history expansion is that some strings that
work in a script will not work when typed into the shell. For example:
echo "Hello World!"
will work when you have it in a script but will fail when you type it into the
terminal as in that case, it thinks the ! is an event designator. Oh well…
6
https://en.wikipedia.org/wiki/Command-line_completion
7
https://www.gnu.org/software/bash/manual/html_node/Event-Designators.
html#Event-Designators
Chapter 3
19
20 CHAPTER 3. WRITING BETTER SCRIPTS
NAME=John
echo Hello ${NAME}!
bash sayhello.sh
Putting the variables first allows you to change them, making your code more
adaptable quickly. Let’s revisit the code we shown before:
Move the variable sections to the start, thus our script now looks like so:
Note how nice and airy this code is. One comment per action. One empty line
after each action. Imagine a fresh breeze bathing each line of code, keeping it
from rotting1 . You know scripting when you derive joy not just from getting
the job done, but from writing simple, sturdy, and pretty lines of code.
1
https://en.wikipedia.org/wiki/Software_rot
3.3. WHY IS IT RIGHT TO STOP A SCRIPT WHEN AN ERROR OCCURS?21
set -uex
24
4.3. WHAT WILL BAD CODE LOOK LIKE? 25
Above we had to use a bash substitution operator %.*. Yuck! The code now
produces:
cutadapt -l 20 SRR1553607_1.fastq -o SRR1553607_1.trimmed.fq
cutadapt -l 20 SRR1553607_2.fastq -o SRR1553607_2.trimmed.fq
cutadapt -l 20 SRR1972917_1.fastq -o SRR1972917_1.trimmed.fq
cutadapt -l 20 SRR1972917_2.fastq -o SRR1972917_2.trimmed.fq
The code is more complicated, less readable, and we still haven’t fixed the
problem of not knowing either the order nor the files that we operate on.
paradigm that will make you smarter. The challenge of descriptive program-
ming is that it forces you to solve the whole problem at once, rather than
isolating it into little pieces.
To use GNU parallell, you have to understand what it does. Here are the
essential tips to get started.
Suppose you have a file called ids.txt that contains
A
B
C
Suppose you wanted to generate this output:
Hello A
Hello B
Hello C
Often information has multiple facets, samples, replicates, genders etc. GNU
parallel has elegant methods to combine them all.
Perhaps you want all possible combinations:
parallel echo Hello {} and {} ::: A B ::: 1 2
Note how both parameters get substituted in each location:
Hello A 1 and A 1
Hello A 2 and A 2
Hello B 1 and B 1
Hello B 2 and B 2
Use the numbered construct {1} if you want to separate the inputs:
parallel echo Hello {1} and {2} ::: A B ::: 1 2
will print:
Hello A and 1
Hello A and 2
Hello B and 1
Hello B and 2
You can take inputs one from each with -link:
parallel --link echo Hello {1} and {2} ::: A B ::: 1 2
will print:
Hello A and 1
Hello B and 2
What this means is that, when you have problem, explain the problem to the
ducky - with real words, as if you were actually talking to the ducky. Have
a ducky ask very simple followup questions like: “What could cause this?”,
“How do you know it does not work?”, “Why does it print that?”
Chances are that act of verbalizing the problem will make you both under-
stand the problem better and with that often helps you solve it.
If you feel embarassed about the ducky, replace the rubber duck with a
friend, or significant other. It seems to works best when the other person
knows nothing about the problem - it removes their “responsibility” of having
to solve the problem.
or in a week, and see that you can explain the rubber ducky the same thing
again. If not, rinse and repeat.
Suppose there is a file called ids.txt that contains
A
B
C
The poor little, confused rubber ducky asks you kindly: “Your Majesty, can
you please explain to me what the code below does?”. Let’s help the ducky
out!
cat ids.txt | parallel echo {} {} {}
1
https://www.biostars.org/p/63816/
2
https://github.com/ole-tange
3
https://www.gnu.org/software/parallel/
32
5.1. GNU PARALLEL - PARALLELIZE SERIAL COMMAND LINE PROGRAMS WITHOUT CHAN
• http://www.gnu.org/software/parallel/
All new computers have multiple cores. Many bioinformatics tools are serial
in nature and will therefore not use the multiple cores. However, many
bioinformatics tasks (especially within NGS) are extremely parallelizeable:
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel
on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward
way to parallelize is to run 8 jobs on each CPU:
4
https://www.usenix.org/publications/login/february-2011-volume-36-number-1/
gnu-parallel-command-line-power-tool
34 CHAPTER 5. GNU PARALLEL IN BIOINFORMATICS
GNU Parallel instead spawns a new process when one finishes - keeping the
CPUs active and thus saving time:
5.2. INSTALLATION 35
5.2 Installation
A personal installation does not require root access. It can be done in 10
seconds by doing this:
# Install GNU parallel.
conda install parallel
Editor’s note
to get help on the various feature of the tool see:
• man parallel
• GNU Parallel online documentation5
36 CHAPTER 5. GNU PARALLEL IN BIOINFORMATICS
Now we want to run experiment for every combination of ages 1..80, sex
M/F, chr 1..22+XY:
parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X
To save the output in different files you could do:
parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..8
But GNU Parallel can structure the output into directories so you avoid
having thousands of output files in a single dir:
parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80}
This will create files like outputdir/1/80/2/M/3/X/stdout containing the
standard output of the job.
If you have many different parameters it may be handy to name them:
parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {C
Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout
If you want the output in a CSV/TSV-file that you can read into R or
LibreOffice Calc, simply point –result to a file ending in .csv/.tsv:
parallel --result output.tsv --header : experiment --age {AGE} --sex {SEX} --chr {
It will deal correctly with newlines in the output, so they will be read as
newlines in R or LibreOffice Calc.
If one of your parameters take on many different values, these can be read
from a file using ‘::::’
echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {
If you have many experiments, it can be useful to see some experiments
picked at random. Think of it as painting a picture by numbers: You can
start from the top corner, or you can paint bits at random. If you paint bits
at random, you will often see a pattern earlier, than if you painted in the
structured way.
With --shuf GNU Parallel will shuffle the experiments and run them all,
but in random order:
parallel --shuf --results outputdir --header : experiment --age {AGE} --sex {SEX}
5.3. HOW TO USE GNU PARALLEL IN PRACTICE 39
LEARN SCRIPTING
41
Chapter 6
When you start out, errors will abound. That’s ok, we make errors all the
time as well. This page lists strategies you can employ to move forward
faster.
43
44 CHAPTER 6. HOW TO LEARN SCRIPTING
Often you are absolutely certain that you did everything right, yet
you are still getting the error. In those cases the answer is that you
only think you are doing it the way you are supposed to.
Surprisingly often your eyes and brain will conspire against you, they might
not let you see that tiny mistake that you made. It happens to me all the
time. Sometimes I see only what I want to see. In those cases the best course
of action is to clear your view:
clear
This clears the terminal and mind and with that all preconceptions you might
have about what the command should look like. Now retype everything from
the very beginning! Build your commands out one piece at a time, keep
pressing TAB and make use of autocomplete.
+ wc -c
14
Note the differences in output relative to the original script:
• The $NAME parameter was substituted
• Actions connected with the pipe | are listed as separate commands.
• The redirection into > foo.txt is not shown.
• You can see the final output of the program 14.
Tracing scripts in your terminal:
ls
-bash: ls: command not found
You just broke the PATH. Wear it as a badge of honor! Look Ma! I broke my
system so thoroughly that not even ls works no more!
I did this a few times in my life - once while writing this book - I was overly
eager to use a meaningful name in an example script and forgot the rule.
Solution: rename the variable something else than PATH for example:
FILE=foo.txt
Annoying problems
Some of these problems are more common for those using Unix via Windows,
yet everyone will eventually need to troubleshoot similar situations.
Let’s turn on the visualization for both “whitespace” and “EOL” (end-of-line
characters). The setting to view whitespaces depends on the Editor, below
we’re showing the choices in Komodo Edit here found under the top-level
menu item called View :
49
50 CHAPTER 7. ANNOYING PROBLEMS
We see that the file actually has quite a bit of content. Spaces are dots .,
TABs are arrows ->, and LF is the new line (line feed) character.
cat Text-1.txt | wc
5 0 51
If this same file were in the so called Windows text format it would look
like this:
7.2. EDITOR REQUIREMENTS 51
In a nutshell:
• Mac/ Unix use the LF (line feed) character to mark the end of the line.
• Windows uses two characters LF and CR (line feed and carriage return)
to mark the end of the line. Thank you, Bill Gates, that is very helpful
(sarcasm).
Make sure the file is in UNIX format, and if not convert it into UNIX line
endings.
3
https://notepad-plus-plus.org/
4
https://www.editplus.com/
5
https://notepad-plus-plus.org/
Part III
AWK PROGRAMMING
53
Chapter 8
Programming concepts
As soon as you start performing data analysis you will run into unanticipated
problems:
• Existing software tools can rarely do all steps.
• You may need to bridge small formatting differences with simple trans-
formations.
• You may need to compute values based on other values.
Often transformation themselves are simple yet impossible to do by hand
due to the volume of the data. Not being able to get past these may bring
your project to a halt; the ability to write programs becomes essential.
The good news is that the programming skills required to get past many
of the obstacles you’ll face are not particularly challenging to learn! Often
writing just a few lines or applying a formula is enough. You can get a lot of
mileage out of fundamental skills.
55
56 CHAPTER 8. PROGRAMMING CONCEPTS
If you are new to the field we would recommend starting with Python.
58
9.1. WHY IS AWK POPULAR WITH BIOINFORMATICIANS? 59
actions are listed in the command. Once you know a bit of awk you immedi-
ately see that a few columns are selected, some columns are rearranged, and
a third column is created by subtracting column 11 from column 10.
If the program were more complicated putting it on one line might be un-
readable, thus typically we only use awk when the programs we need to write
are exceedingly simple.
Note how in the animated screencast I build the awk program from a smaller,
simpler program, continously refining it, adding more into it. Press the
up-arrow to save typing the command again. When I am happy with the
command I will place it into a script.
As for the example above, while we can use cut to select columns we couldn’t
use it to rearrange them or to compute new or other values. Here is a
comparison of cut and awk:
# Matching ORFs and printing the start/ends.
cat SGD_features.tab | cut -f 2,4,10,11 | grep ORF | head
produces:
ORF YAL069W 335 649
ORF YAL068W-A 538 792
ORF YAL068C 2169 1807
ORF YAL067W-A 2480 2707
ORF YAL067C 9016 7235
...
now with awk we can rearrange columns and compute even compute new
ones (start-end coordinate)
cat SGD_features.tab | awk -F '\t' ' $2=="ORF" { print $4, $2, $10, $11, $11-$10 } ' | head
to print:
YAL069W ORF 335 649 314
YAL068W-A ORF 538 792 254
YAL068C ORF 2169 1807 -362
YAL067W-A ORF 2480 2707 227
YAL067C ORF 9016 7235 -1781
...
The expression $2=="ORF" { print $4, $2, $10, $11, $11-$10 } is an
awk program.
60 CHAPTER 9. PROGRAMMING WITH AWK
With the default splitting behavior when we split the lines containing:
A B
A B
we end up with the same result column 1 is A and column 2 is B for each line.
In some cases, this is what we might have wanted.
In general, and especially when processing tabular data, we don’t want this
behavior. Imagine a tab-delimited file where tabs indicate columns. Say in
one row two empty columns follow each other. The default awk splitting
behavior would collapse these two tabs into one and subsequently shift the
rest of the columns by one, producing values from the wrong column. So
reading column 10 would sometimes give you the value in column 10 but
other times the value in column 11, 12, etc. Needless to say, this is highly
undesirable behavior.
In general, you always want to specify the splitting character when using
awk:
awk -F '\t'
The flag above will set the field separator to be the “tab” character. If you are
confident that your file does not have consecutive tabs, for example, some file
formats always require the presence of data, a zero or empty marker perhaps,
then you would not need to set the field separator.
That being said when there are more actions these can be placed into a file
like so:
CONDITION1 {
ACTIONS1
}
CONDITION2 {
ACTIONS2
}
and we can run the program through awk via the -f flag:
cat mydata.txt | awk -f myprogram.awk | head
The first number is the line number; the second number is how many columns
does awk think that the file has. Look what happens if you split by tab
characters:
cat SGD_features.tab | awk -F '\t' '{ print NR, NF }' | head
you will get:
1 16
2 16
3 16
4 16
5 16
6 16
...
Note how splitting with different field separator makes awk think that the file
has different number of columns. Read the previous entry if you are unsure
what is happening!
when you learn it for one language, the same concepts will apply for all others.
Here is a fancier printing example:
echo | awk '{ printf("A=%0.3f B=%d C=%03d",1,2,3) }'
will print:
A=1.000 B=2 C=003
See more printf examples1
1
https://www.gnu.org/software/gawk/manual/html_node/Printf-Examples.
html#Printf-Examples
2
http://www.catonmat.net/blog/wp-content/uploads/2008/09/awk1line.txt
3
http://www.catonmat.net/blog/awk-one-liners-explained-part-one/
4
http://www.catonmat.net/blog/awk-nawk-and-gawk-cheat-sheet/
5
http://www.gnu.org/software/gawk/manual/gawk.html
Chapter 10
66
10.1. WHAT IS BIOAWK? 67
bioawk -c help
it will print:
bed:
1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockco
sam:
1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual
vcf:
1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info
gff:
1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attrib
fastx:
1:name 2:seq 3:qual 4:comment
What this states is that when you are using the sam format you may use
variable names such as qname, flag etc.
For example:
prints:
SRR1972739.1
SRR1972739.2
SRR1972739.3
...
cat SRR1972739_1.fastq | bioawk -c fastx ' { print $name, gc($seq) } ' | head
Above we are making use of bioawk specific variables $name and $seq as well
as the bioawk specific function gc (GC content) to produce:
SRR1972739.1 0.306931
SRR1972739.2 0.49505
SRR1972739.3 0.415842
SRR1972739.4 0.514851
68 CHAPTER 10. PROGRAMMING WITH BIOAWK
wget http://data.biostarhandbook.com/bam/demo.bam
samtools view -f 2 demo.bam | awk '$6 ~ /D/ { print $6 }' | head
samtools view -f 2 demo.bam | bioawk -c sam '$cigar ~ /D/ { print $cigar }' | head
samtools view -f 2 demo.bam | awk '/NM:i:3/ { print $6, $12, $13, $14 }' | head
One really handy feature of awk is that allows mutating a part of the line
and printing it out in otherwise original format:
samtools depth demo.bam | awk ' $3 <= 3 { $1="LO" } $3 > 3 { $1="HI" } { print $0 }' |
Note how we assign to $1 but then print $0. Here is what the code above
does:
LO 46 3
HI 47 4
HI 48 4
...
10.2. HOW TO USE AWK/BIOAWK FOR PROCESSING DATA? 69
This seemingly odd feature is * handy* when the line has many columns,
and we need to overwrite just one while still keeping it in the same format.
Imagine that you had a file with 100 columns and wanted to change only one
column. It would be tedious and error-prone to piece back together the entire
line. Using the technique shown above you can mutate just one column then
print the line.
Chapter 11
Terminal multiplexing
For large datasets, the time required to complete the analyses will become
longer. It becomes essential to ensure that the jobs do not get interrupted
by closing a terminal, accidentally logging off etc.
A terminal multiplexer is a software that is designed to keep programs active
even when you are not logged into the computer. You have several options
to choose from we recommend tmux1 or GNU Screen2
In our examples we will use the tmux terminal multiplexer.
70
11.2. WHAT DOES A TERMINAL MULTIPLEXER DO? 71
You may also ignore the naming and just run tmux as a command to get a
number like 0, 1 etc automatically assigned as a name). In general, it is a
good habit to name your sessions, later you could have a hard time figuring
out which session contains what. Our command above created a session
named foo that on my system is displayed like so:
date
Let’s demonstrate what tmux is used for. Press ctrl-b then the d key. The
ctrl-b keypress is the so called “prefix” that triggers tmux commands. In
this case the d triggers the “detach” command. We now get the following
view:
The previous session is still running in the background. We can now exit the
terminal, log off this computer, or do anything else (other than shutting the
11.2. WHAT DOES A TERMINAL MULTIPLEXER DO? 73
computer down), and commands running in the session will remain active.
Note: the detaching will also happen automatically if the connection with
the terminal gets interrupted. To re-activate the previous session, we first
list all available sessions:
tmux ls
it prints:
We can see the sessions named foo here. Note that you may have multiple
sessions; moreover, each session may have multiple windows within it (see
later). In our case, we have one session and one window. To reattach the
session to the current terminal type:
While I prefer to use window commands some people like to see the commands
in parallel in so-called “panels”
• ctrl-b, % split the screen in half from left to right
• ctrl-b, " split the screen in half from top to bottom
• ctrl-b, x kill the current pane
• ctrl-b, <arrow key> switch to the pane in whichever direction you
press
• ctrl-b, d detach from tmux, leaving everything running in the back-
ground
Here we have window split into two panels:
76 CHAPTER 11. TERMINAL MULTIPLEXING
3
https://www.google.com/search?q=tmux
Part IV
ADVANCED CONCEPTS
77
Chapter 12
Usage:
getdata.sh PRJNUM COUNT LIMT
Parameters:
PRJNUM = SRA Bioproject number
COUNT = how many sequencing runs to download
79
80 CHAPTER 12. ADVANCED SHELL CONCEPTS
Example:
getdata.sh PRJN2234 1000 5
To achieve this effect add the following to the beginning of the script:
if [ $# -eq 0 ]; then
echo
echo "*** This script needs arguments to work! ***"
echo
echo "Usage:"
echo " getdata.sh PRJNUM COUNT LIMT"
echo
echo "Parameters:"
echo " PRJNUM = SRA Bioproject number"
echo " COUNT = how many sequencing runs to download"
echo " LIMIT = how many reads to extract per sequencing run"
echo
echo "Example:"
echo " getdata.sh PRJN2234 1000 5"
echo
exit 1
fi
By default both stdout and stderr are directed to the terminal. For example
I might type:
ls * foo
it prints:
ls: foo: No such file or directory
A.txt
So we have a file called A.txt but nothing called foo. Next type:
ls * foo > B.txt
What will it print?
ls: foo: No such file or directory
Now, see how the line A.txt is not printed anymore. A new file called B.txt
was created that contains the file name. But we still see the error message.
We have redirected the so called standard output to a file, but the standard
error is still printed to the terminal. To redirect the error stream use 2>:
ls * foo > B.txt 2> err.txt
It is the same as writing
ls * foo 1> B.txt 2> err.txt
but we almost never use 1> as long as > does the job.
Removes the directory from the name and keep the file name only use the
basename shell command like so:
FILE=/A/B/C.txt.gz
NAME=$(basename ${FILE})
echo $NAME
prints:
C.txt.gz
To chop off the rightmost extension use the operator ‘%.*“:
FILE=/A/B/C.txt.gz
CHOP=${FILE%.*}
echo $CHOP
it will print (note the .gz is missing):
/A/B/C.txt
Now to get only the extension gz use the operator ##*. like so:
FILE=/A/B/C.txt.gz
CHOP=${FILE##*.}
echo $CHOP
it prints:
gz
To remove all path elements until the leftmost dot us the #*. operator:
FILE=/A/B/C.txt.gz
CHOP=${FILE#*.}
echo $CHOP
it prints:
txt.gz
If you need a transformation that is not offered as a single bash transforma-
tion, you can usually build it by successively applying separate instructions
in order.
84 CHAPTER 12. ADVANCED SHELL CONCEPTS
Printing to the screen will allow you first to see what the commands will look
like when executed. If it seems right, to execute your commands, pipe them
to the bash command again.
bash myscript.sh | bash
Here is as an example another script that now takes a project is and a run
count that can be run as:
bash getproject.sh PRJNA257197 10
We want our script to fetch 10 sequencing datasets from the project
PRJNA257197, then run a trimmomatic quality trim on each, and finally, run
a fastqc report on the resulting files.
Here is our script:
#
# Usage: getproject.sh PRJN NUM
#
# Example: bash getproject.sh PRJNA257197 3
#
# This program downloads a NUM number of experiments from an SRA Bioproject
# identified by the PRJN number then runs FastQC quality reports before
# and after trimming the data for quality.
esearch -db sra -query $PRJN | efetch -format runinfo > runinfo.csv
# We place an echo before each command to see what will get executed when it
cat ids.csv | parallel echo fastq-dump --split-files -X ${LIMIT} {}
# Run FastQC on the new data that matches the *.fq extension.
cat ids.csv | parallel echo fastqc {}_1P.fq {}_2P.fq
Having a clear understanding of what the script will help you troubleshoot
any errors that crop up.
For more ideas, see the article Unofficial Bash Strict Mode1
1
http://redsymbol.net/articles/unofficial-bash-strict-mode/
2
https://en.wikipedia.org/wiki/Cat_%28Unix%29
3
https://en.wikipedia.org/wiki/Cat_%28Unix%29
Chapter 13
Whereas writing scripts is useful and helps immensely, there may come a time
when you want to keep all related functionality in one, single file, yet only
run a sub-section of it. A software tool named make and its corresponding
Makefile are designed to do that (and a lot more as well).
make will allow you to set up a veritable command and control center for
your data analysis. With Makefiles you will be able to keep track of what
you have done and how.
In our opinion the path to reproducible research starts with Makefiles. Per-
haps at some point in the future when scientists submit their data analysis
results, they will also be required to also attach their Makefiles that describe
the steps they took in a clear and explicit way.
Note
In this section we will only cover a tiny featureset of make. There
is a lot more to the tool than what we mention. The great thing
is that even using a sliver of make’s functionality can help you
immensely.
As you become more confident you may want to read more about
the tool and what it can do.
88
13.1. WHAT IS A MAKEFILE? 89
bar:
echo Hello Jane!
echo Hello Everyone!
Note that in the above code you MUST have tabs in front of the commands
for the makefile to work with make. If we viewed tab characters as ‘–> it
would look like so:
foo:
--> echo Hello John!
bar:
--> echo Hello Jane!
--> echo Hello Everyone!
This Makefile has two target rules: foo and bar. Once you create this file
you may type:
make foo
and it will print
Hello John!
or you can type:
make bar
90 CHAPTER 13. MASTERY WITH MAKEFILES
bar:
^Iecho Hello Jane!
^Iecho Hello Everyone!
it will look like this:
One common error to check for is neglecting to put tabs in front of the com-
mands. Many text editors insist on inserting a number of space characters
even if you press TAB and not SPACE on your keyboard.
View the whitespaces in your editor!
92 CHAPTER 13. MASTERY WITH MAKEFILES
echo "Done"
Will always print “Done” even if the file this is not present.
hello:
@echo Hello ${NAME}
But remember - the simplest feature that you learned above already saves
you a lot of effort.
We have beautiful reproducible workflows in make, without using any of the
advanced features. Unless you use these advanced features a lot you’ll tend
to have to re-learn them. Often not worth the time.
3
https://bitbucket.org/snakemake/snakemake/wiki/Home
Part V
APPENDIX
95
Chapter 14
Appendix Notes
97
Chapter 15
15.1 EXPLANATIONS
For detailed explanations of each pattern see:
• https://catonmat.net/awk-one-liners-explained-part-one
15.2 USAGE
awk '/pattern/ {print "$1"}' # standard Unix shells
98
15.4. NUMBERING AND CALCULATIONS 99
awk 'BEGIN{ORS="\n\n"};1'
# double space a file which already has blank lines in it. Output file
# should contain no more than one blank line between lines of text.
# NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are
# often treated as non-blank, and thus 'NF' alone will return TRUE.
awk 'NF{print $0 "\n"}'
# precede each line by its line number FOR ALL FILES TOGETHER, with tab.
awk '{print NR "\t" $0}' files*
# number each line of file, but only print numbers if line is not blank
# Remember caveats about Unix treatment of \r (mentioned above)
awk 'NF{$0=++a " :" $0};{print}'
awk '{print (NF? ++a " :" :"") $0}'
# print every line after replacing each field with its absolute value
awk '{for (i=1; i<=NF; i++) if ($i < 0) $i = -$i; print }'
awk '{for (i=1; i<=NF; i++) $i = ($i < 0) ? -$i : $i; print }'
# print the largest first field and the line that contains it
# Intended for finding the longest string in field #1
awk '$1 > max {max=$1; maxline=$0}; END{ print max, maxline}'
# print every line where the value of the last field is > 4
awk '$NF > 4'
# substitute "foo" with "bar" ONLY for lines which contain "baz"
awk '/baz/{gsub(/foo/, "bar")};{print}'
# substitute "foo" with "bar" EXCEPT for lines which contain "baz"
102 CHAPTER 15. USEFUL ONE-LINE SCRIPTS FOR AWK
# print only lines which do NOT match regex (emulates "grep -v")
awk '!/regex/'
# print the line immediately before a regex, but not the line
# containing the regex
awk '/regex/{print x};{x=$0}'
awk '/regex/{print (x=="" ? "match on line 1" : x)};{x=$0}'
# print the line immediately after a regex, but not the line
# containing the regex
awk '/regex/{getline;print}'
# grep for AAA and BBB and CCC (in any order)
awk '/AAA/; /BBB/; /CCC/'
# grep for AAA and BBB and CCC (in that order)
awk '/AAA.*BBB.*CCC/'
To fully exploit the power of awk, one must understand “regular expressions.”
For detailed discussion of regular expressions, see “Mastering Regular Ex-
pressions, 2d edition” by Jeffrey Friedl (O’Reilly, 2002).
The manual (“man”) pages on Unix systems may be helpful (try “man awk”,
“man nawk”, “man regexp”, or the section on regular expressions in “man
ed”), but man pages are notoriously difficult. They are not written to teach
awk use or regexps to first-time users, but as a reference text for those already
acquainted with these tools.
USE OF ‘’�IN awk SCRIPTS: For clarity in documentation, we have used the
expression ‘’�to indicate a tab character (0x09) in the scripts. All versions of
awk, even the UNIX System 7 version should recognize the ‘’�abbreviation.
Chapter 16
16.1 EXPLANATIONS
For detailed explanations of each pattern see:
• https://catonmat.net/sed-one-liners-explained-part-one
# double space a file which already has blank lines in it. Output file
# should contain no more than one blank line between lines of text.
sed '/^$/d;G'
106
16.3. NUMBERING 107
sed 'G;G'
# insert a blank line above and below every line which matches "regex"
sed '/regex/{x;p;x;G;}'
16.3 NUMBERING
# number each line of a file (simple left alignment). Using a tab (see
# note on '\t' at end of file) instead of space will preserve margins.
sed = filename | sed 'N;s/\n/\t/'
# number each line of file, but only print numbers if line is not blank
sed '/./=' filename | sed '/./N; s/\n/ /'
# substitute "foo" with "bar" ONLY for lines which contain "baz"
sed '/baz/s/foo/bar/g'
# substitute "foo" with "bar" EXCEPT for lines which contain "baz"
sed '/baz/!s/foo/bar/g'
# add commas to numbers with decimal points and minus signs (GNU sed)
gsed -r ':a;s/(^|[^0-9.])([0-9]+)([0-9]{3})/\1\2,\3/g;ta'
# add a blank line every 5 lines (after lines 5, 10, 15, 20, etc.)
gsed '0~5G' # GNU sed only
sed 'n;n;n;n;G;' # other seds
# print only lines which do NOT match regexp (emulates "grep -v")
sed -n '/regexp/!p' # method 1, corresponds to above
sed '/regexp/d' # method 2, simpler syntax
# print the line immediately before a regexp, but not the line
# containing the regexp
sed -n '/regexp/{g;1!p;};h'
# print the line immediately after a regexp, but not the line
# containing the regexp
sed -n '/regexp/{n;p;}'
# print 1 line of context before and after regexp, with line number
# indicating where the regexp occurred (similar to "grep -A1 -B1")
sed -n -e '/regexp/{=;x;1!p;g;$!N;p;D;}' -e h
# grep for AAA and BBB and CCC (in any order)
sed '/AAA/!d; /BBB/!d; /CCC/!d'
# grep for AAA and BBB and CCC (in that order)
sed '/AAA.*BBB.*CCC/!d'
# print paragraph if it contains AAA and BBB and CCC (in any order)
112 CHAPTER 16. USEFUL ONE-LINE SCRIPTS FOR SED
sed '/Iowa/,/Montana/d'
# delete ALL blank lines from a file (same as "grep '.' ")
sed '/^$/d' # method 1
sed '/./!d' # method 2
# delete all CONSECUTIVE blank lines from file except the first; also
114 CHAPTER 16. USEFUL ONE-LINE SCRIPTS FOR SED
# deletes all blank lines from top and end of file (emulates "cat -s")
sed '/./,/^$/!d' # method 1, allows 0 blanks at top, 1 at EOF
sed '/^$/N;/\n$/D' # method 2, allows 1 blank at top, 0 at EOF
# delete all CONSECUTIVE blank lines from file except the first 2:
sed '/^$/N;/\n$/N;//D'
# parse out the address proper. Pulls out the e-mail address by itself
# from the 1-line return address header (see preceding script)
sed 's/ *(.*)//; s/>.*//; s/.*[:<] *//'
# add a leading angle bracket and space to each line (quote a message)
sed 's/^/> /'
# delete leading angle bracket & space from each line (unquote a message)
sed 's/^> //'
# zip up each .TXT file individually, deleting the source file and
# setting the name of each .ZIP file to the basename of the .TXT file
# (under DOS: the "dir /b" switch returns bare filenames in all caps).
echo @echo off >zipup.bat
dir /b *.txt | sed "s/^\(.*\)\.TXT/pkzip -mo \1 \1.TXT/" >>zipup.bat
116 CHAPTER 16. USEFUL ONE-LINE SCRIPTS FOR SED